arxiv-daily icon indicating copy to clipboard operation
arxiv-daily copied to clipboard

New submissions for Mon, 12 Feb 24

Open DongZhouGu opened this issue 1 year ago • 0 comments

Keyword: human object interaction

There is no result

Keyword: visual relation detection

There is no result

Keyword: object detection

A versatile robotic hand with 3D perception, force sensing for autonomous manipulation

  • Authors: Nikolaus Correll, Dylan Kriegman, Stephen Otto, James Watson
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2402.06018
  • Pdf link: https://arxiv.org/pdf/2402.06018
  • Abstract We describe a force-controlled robotic gripper with built-in tactile and 3D perception. We also describe a complete autonomous manipulation pipeline consisting of object detection, segmentation, point cloud processing, force-controlled manipulation, and symbolic (re)-planning. The design emphasizes versatility in terms of applications, manufacturability, use of commercial off-the-shelf parts, and open-source software. We validate the design by characterizing force control (achieving up to 32N, controllable in steps of 0.08N), force measurement, and two manipulation demonstrations: assembly of the Siemens gear assembly problem, and a sensor-based stacking task requiring replanning. These demonstrate robust execution of long sequences of sensor-based manipulation tasks, which makes the resulting platform a solid foundation for researchers in task-and-motion planning, educators, and quick prototyping of household, industrial and warehouse automation tasks.

SWITCH: An Exemplar for Evaluating Self-Adaptive ML-Enabled Systems

  • Authors: Arya Marda, Shubham Kulkarni, Karthik Vaidhyanathan
  • Subjects: Software Engineering (cs.SE)
  • Arxiv link: https://arxiv.org/abs/2402.06351
  • Pdf link: https://arxiv.org/pdf/2402.06351
  • Abstract Addressing runtime uncertainties in Machine Learning-Enabled Systems (MLS) is crucial for maintaining Quality of Service (QoS). The Machine Learning Model Balancer is a concept that addresses these uncertainties by facilitating dynamic ML model switching, showing promise in improving QoS in MLS. Leveraging this concept, this paper introduces SWITCH, an exemplar developed to enhance self-adaptive capabilities in such systems through dynamic model switching in runtime. SWITCH is designed as a comprehensive web service catering to a broad range of ML scenarios, with its implementation demonstrated through an object detection use case. SWITCH provides researchers with a flexible platform to apply and evaluate their ML model switching strategies, aiming to enhance QoS in MLS. SWITCH features advanced input handling, real-time data processing, and logging for adaptation metrics supplemented with an interactive real-time dashboard for enhancing system observability. This paper details SWITCH's architecture, self-adaptation strategies through ML model switching, and its empirical validation through a case study, illustrating its potential to improve QoS in MLS. By enabling a hands-on approach to explore adaptive behaviors in ML systems, SWITCH contributes a valuable tool to the SEAMS community for research into self-adaptive mechanisms for MLS and their practical applications.

Keyword: transformer

Todyformer: Towards Holistic Dynamic Graph Transformers with Structure-Aware Tokenization

  • Authors: Mahdi Biparva, Raika Karimi, Faezeh Faez, Yingxue Zhang
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.05944
  • Pdf link: https://arxiv.org/pdf/2402.05944
  • Abstract Temporal Graph Neural Networks have garnered substantial attention for their capacity to model evolving structural and temporal patterns while exhibiting impressive performance. However, it is known that these architectures are encumbered by issues that constrain their performance, such as over-squashing and over-smoothing. Meanwhile, Transformers have demonstrated exceptional computational capacity to effectively address challenges related to long-range dependencies. Consequently, we introduce Todyformer-a novel Transformer-based neural network tailored for dynamic graphs. It unifies the local encoding capacity of Message-Passing Neural Networks (MPNNs) with the global encoding of Transformers through i) a novel patchifying paradigm for dynamic graphs to improve over-squashing, ii) a structure-aware parametric tokenization strategy leveraging MPNNs, iii) a Transformer with temporal positional-encoding to capture long-range dependencies, and iv) an encoding architecture that alternates between local and global contextualization, mitigating over-smoothing in MPNNs. Experimental evaluations on public benchmark datasets demonstrate that Todyformer consistently outperforms the state-of-the-art methods for downstream tasks. Furthermore, we illustrate the underlying aspects of the proposed model in effectively capturing extensive temporal dependencies in dynamic graphs.

A Hyper-Transformer model for Controllable Pareto Front Learning with Split Feasibility Constraints

  • Authors: Tran Anh Tuan, Nguyen Viet Dung, Tran Ngoc Thang
  • Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
  • Arxiv link: https://arxiv.org/abs/2402.05955
  • Pdf link: https://arxiv.org/pdf/2402.05955
  • Abstract Controllable Pareto front learning (CPFL) approximates the Pareto solution set and then locates a Pareto optimal solution with respect to a given reference vector. However, decision-maker objectives were limited to a constraint region in practice, so instead of training on the entire decision space, we only trained on the constraint region. Controllable Pareto front learning with Split Feasibility Constraints (SFC) is a way to find the best Pareto solutions to a split multi-objective optimization problem that meets certain constraints. In the previous study, CPFL used a Hypernetwork model comprising multi-layer perceptron (Hyper-MLP) blocks. With the substantial advancement of transformer architecture in deep learning, transformers can outperform other architectures in various tasks. Therefore, we have developed a hyper-transformer (Hyper-Trans) model for CPFL with SFC. We use the theory of universal approximation for the sequence-to-sequence function to show that the Hyper-Trans model makes MED errors smaller in computational experiments than the Hyper-MLP model.

Pathformer: Multi-scale transformers with Adaptive Pathways for Time Series Forecasting

  • Authors: Peng Chen, Yingying Zhang, Yunyao Cheng, Yang Shu, Yihang Wang, Qingsong Wen, Bin Yang, Chenjuan Guo
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.05956
  • Pdf link: https://arxiv.org/pdf/2402.05956
  • Abstract Transformer-based models have achieved some success in time series forecasting. Existing methods mainly model time series from limited or fixed scales, making it challenging to capture different characteristics spanning various scales. In this paper, we propose multi-scale transformers with adaptive pathways (Pathformer). The proposed Transformer integrates both temporal resolution and temporal distance for multi-scale modeling. Multi-scale division divides the time series into different temporal resolutions using patches of various sizes. Based on the division of each scale, dual attention is performed over these patches to capture global correlations and local details as temporal dependencies. We further enrich the multi-scale transformer with adaptive pathways, which adaptively adjust the multi-scale modeling process based on the varying temporal dynamics in the input time series, improving the prediction accuracy and generalization of Pathformer. Extensive experiments on eleven real-world datasets demonstrate that Pathformer not only achieves state-of-the-art performance by surpassing all current models but also exhibits stronger generalization abilities under various transfer scenarios.

A Survey on Transformer Compression

  • Authors: Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao
  • Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2402.05964
  • Pdf link: https://arxiv.org/pdf/2402.05964
  • Abstract Large models based on the Transformer architecture play increasingly vital roles in artificial intelligence, particularly within the realms of natural language processing (NLP) and computer vision (CV). Model compression methods reduce their memory and computational cost, which is a necessary step to implement the transformer models on practical devices. Given the unique architecture of transformer, featuring alternative attention and Feedforward Neural Network (FFN) modules, specific compression techniques are required. The efficiency of these compression methods is also paramount, as it is usually impractical to retrain large models on the entire training dataset.This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to transformer models. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design. In each category, we discuss compression methods for both CV and NLP tasks, highlighting common underlying principles. At last, we delve into the relation between various compression methods, and discuss the further directions in this domain.

The last Dance : Robust backdoor attack via diffusion models and bayesian approach

  • Authors: Orson Mengara
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2402.05967
  • Pdf link: https://arxiv.org/pdf/2402.05967
  • Abstract Diffusion models are state-of-the-art deep learning generative models that are trained on the principle of learning forward and backward diffusion processes via the progressive addition of noise and denoising. In this paper, we seek to trick audio-based DNN models, such as those in the Hugging Face framework, for example, those that focus on audio, in particular transformer-based artificial intelligence models, which are powerful machine learning models that save time and deliver faster, more efficient results. We demonstrate the feasibility of backdoor attacks (called BacKBayDiffMod) on audio transformers derived from Hugging Face, a popular framework in the world of artificial intelligence (AI) research. The backdoor attack developed in this paper is based on poisoning the model's training data by incorporating backdoor diffusion sampling and a Bayesian approach to the distribution of poisoned data.

Breaking Symmetry When Training Transformers

  • Authors: Chunsheng Zuo, Michael Guerzhoy
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.05969
  • Pdf link: https://arxiv.org/pdf/2402.05969
  • Abstract As we show in this paper, the prediction for output token $n+1$ of Transformer architectures without one of the mechanisms of positional encodings and causal attention is invariant to permutations of input tokens $1, 2, ..., n-1$. Usually, both mechanisms are employed and the symmetry with respect to the input tokens is broken. Recently, it has been shown that one can train Transformers without positional encodings. This must be enabled by the causal attention mechanism. In this paper, we elaborate on the argument that the causal connection mechanism must be responsible for the fact that Transformers are able to model input sequences where the order is important. Vertical "slices" of Transformers are all encouraged to represent the same location $k$ in the input sequence. We hypothesize that residual connections contribute to this phenomenon, and demonstrate evidence for this.

Memory-Efficient Vision Transformers: An Activation-Aware Mixed-Rank Compression Strategy

  • Authors: Seyedarmin Azizi, Mahdi Nazemi, Massoud Pedram
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2402.06004
  • Pdf link: https://arxiv.org/pdf/2402.06004
  • Abstract As Vision Transformers (ViTs) increasingly set new benchmarks in computer vision, their practical deployment on inference engines is often hindered by their significant memory bandwidth and (on-chip) memory footprint requirements. This paper addresses this memory limitation by introducing an activation-aware model compression methodology that uses selective low-rank weight tensor approximations of different layers to reduce the parameter count of ViTs. The key idea is to decompose the weight tensors into a sum of two parameter-efficient tensors while minimizing the error between the product of the input activations with the original weight tensor and the product of the input activations with the approximate tensor sum. This approximation is further refined by adopting an efficient layer-wise error compensation technique that uses the gradient of the layer's output loss. The combination of these techniques achieves excellent results while it avoids being trapped in a shallow local minimum early in the optimization process and strikes a good balance between the model compression and output accuracy. Notably, the presented method significantly reduces the parameter count of DeiT-B by 60% with less than 1% accuracy drop on the ImageNet dataset, overcoming the usual accuracy degradation seen in low-rank approximations. In addition to this, the presented compression technique can compress large DeiT/ViT models to have about the same model size as smaller DeiT/ViT variants while yielding up to 1.8% accuracy gain. These results highlight the efficacy of our approach, presenting a viable solution for embedding ViTs in memory-constrained environments without compromising their performance.

AI enhanced data assimilation and uncertainty quantification applied to Geological Carbon Storage

  • Authors: G. S. Seabra (1, 2), N. T. Mücke (3, 4), V. L. S. Silva (2, 5), D. Voskov (1, 6), F. Vossepoel (1) ((1) TU Delft, Netherlands, (2) Petrobras, Brazil, (3) Centrum Wiskunde & Informatica, Netherlands, (4) Utrecht University, Netherlands, (5) Imperial College London, United Kingdom, (6) Stanford University, USA)
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.06110
  • Pdf link: https://arxiv.org/pdf/2402.06110
  • Abstract This study investigates the integration of machine learning (ML) and data assimilation (DA) techniques, focusing on implementing surrogate models for Geological Carbon Storage (GCS) projects while maintaining high fidelity physical results in posterior states. Initially, we evaluate the surrogate modeling capability of two distinct machine learning models, Fourier Neural Operators (FNOs) and Transformer UNet (T-UNet), in the context of CO$_2$ injection simulations within channelized reservoirs. We introduce the Surrogate-based hybrid ESMDA (SH-ESMDA), an adaptation of the traditional Ensemble Smoother with Multiple Data Assimilation (ESMDA). This method uses FNOs and T-UNet as surrogate models and has the potential to make the standard ESMDA process at least 50% faster or more, depending on the number of assimilation steps. Additionally, we introduce Surrogate-based Hybrid RML (SH-RML), a variational data assimilation approach that relies on the randomized maximum likelihood (RML) where both the FNO and the T-UNet enable the computation of gradients for the optimization of the objective function, and a high-fidelity model is employed for the computation of the posterior states. Our comparative analyses show that SH-RML offers better uncertainty quantification compared to conventional ESMDA for the case study.

Jointly Learning Representations for Map Entities via Heterogeneous Graph Contrastive Learning

  • Authors: Jiawei Jiang, Yifan Yang, Jingyuan Wang, Junjie Wu
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.06135
  • Pdf link: https://arxiv.org/pdf/2402.06135
  • Abstract The electronic map plays a crucial role in geographic information systems, serving various urban managerial scenarios and daily life services. Developing effective Map Entity Representation Learning (MERL) methods is crucial to extracting embedding information from electronic maps and converting map entities into representation vectors for downstream applications. However, existing MERL methods typically focus on one specific category of map entities, such as POIs, road segments, or land parcels, which is insufficient for real-world diverse map-based applications and might lose latent structural and semantic information interacting between entities of different types. Moreover, using representations generated by separate models for different map entities can introduce inconsistencies. Motivated by this, we propose a novel method named HOME-GCL for learning representations of multiple categories of map entities. Our approach utilizes a heterogeneous map entity graph (HOME graph) that integrates both road segments and land parcels into a unified framework. A HOME encoder with parcel-segment joint feature encoding and heterogeneous graph transformer is then deliberately designed to convert segments and parcels into representation vectors. Moreover, we introduce two types of contrastive learning tasks, namely intra-entity and inter-entity tasks, to train the encoder in a self-supervised manner. Extensive experiments on three large-scale datasets covering road segment-based, land parcel-based, and trajectory-based tasks demonstrate the superiority of our approach. To the best of our knowledge, HOME-GCL is the first attempt to jointly learn representations for road segments and land parcels using a unified model.

A self-supervised framework for learning whole slide representations

  • Authors: Xinhai Hou, Cheng Jiang, Akhil Kondepudi, Yiwei Lyu, Asadur Zaman Chowdury, Honglak Lee, Todd C. Hollon
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.06188
  • Pdf link: https://arxiv.org/pdf/2402.06188
  • Abstract Whole slide imaging is fundamental to biomedical microscopy and computational pathology. However, whole slide images (WSIs) present a complex computer vision challenge due to their gigapixel size, diverse histopathologic features, spatial heterogeneity, and limited/absent data annotations. These challenges highlight that supervised training alone can result in suboptimal whole slide representations. Self-supervised representation learning can achieve high-quality WSI visual feature learning for downstream diagnostic tasks, such as cancer diagnosis or molecular genetic prediction. Here, we present a general self-supervised whole slide learning (S3L) framework for gigapixel-scale self-supervision of WSIs. S3L combines data transformation strategies from transformer-based vision and language modeling into a single unified framework to generate paired views for self-supervision. S3L leverages the inherent regional heterogeneity, histologic feature variability, and information redundancy within WSIs to learn high-quality whole-slide representations. We benchmark S3L visual representations on two diagnostic tasks for two biomedical microscopy modalities. S3L significantly outperforms WSI baselines for cancer diagnosis and genetic mutation prediction. Additionally, S3L achieves good performance using both in-domain and out-of-distribution patch encoders, demonstrating good flexibility and generalizability.

Masked LoGoNet: Fast and Accurate 3D Image Analysis for Medical Domain

  • Authors: Amin Karimi Monsefi, Payam Karisani, Mengxi Zhou, Stacey Choi, Nathan Doble, Heng Ji, Srinivasan Parthasarathy, Rajiv Ramnath
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.06190
  • Pdf link: https://arxiv.org/pdf/2402.06190
  • Abstract Standard modern machine-learning-based imaging methods have faced challenges in medical applications due to the high cost of dataset construction and, thereby, the limited labeled training data available. Additionally, upon deployment, these methods are usually used to process a large volume of data on a daily basis, imposing a high maintenance cost on medical facilities. In this paper, we introduce a new neural network architecture, termed LoGoNet, with a tailored self-supervised learning (SSL) method to mitigate such challenges. LoGoNet integrates a novel feature extractor within a U-shaped architecture, leveraging Large Kernel Attention (LKA) and a dual encoding strategy to capture both long-range and short-range feature dependencies adeptly. This is in contrast to existing methods that rely on increasing network capacity to enhance feature extraction. This combination of novel techniques in our model is especially beneficial in medical image segmentation, given the difficulty of learning intricate and often irregular body organ shapes, such as the spleen. Complementary, we propose a novel SSL method tailored for 3D images to compensate for the lack of large labeled datasets. The method combines masking and contrastive learning techniques within a multi-task learning framework and is compatible with both Vision Transformer (ViT) and CNN-based models. We demonstrate the efficacy of our methods in numerous tasks across two standard datasets (i.e., BTCV and MSD). Benchmark comparisons with eight state-of-the-art models highlight LoGoNet's superior performance in both inference time and accuracy.

TEE4EHR: Transformer Event Encoder for Better Representation Learning in Electronic Health Records

  • Authors: Hojjat Karami, David Atienza, Anisoara Ionescu
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.06367
  • Pdf link: https://arxiv.org/pdf/2402.06367
  • Abstract Irregular sampling of time series in electronic health records (EHRs) is one of the main challenges for developing machine learning models. Additionally, the pattern of missing data in certain clinical variables is not at random but depends on the decisions of clinicians and the state of the patient. Point process is a mathematical framework for analyzing event sequence data that is consistent with irregular sampling patterns. Our model, TEE4EHR, is a transformer event encoder (TEE) with point process loss that encodes the pattern of laboratory tests in EHRs. The utility of our TEE has been investigated in a variety of benchmark event sequence datasets. Additionally, we conduct experiments on two real-world EHR databases to provide a more comprehensive evaluation of our model. Firstly, in a self-supervised learning approach, the TEE is jointly learned with an existing attention-based deep neural network which gives superior performance in negative log-likelihood and future event prediction. Besides, we propose an algorithm for aggregating attention weights that can reveal the interaction between the events. Secondly, we transfer and freeze the learned TEE to the downstream task for the outcome prediction, where it outperforms state-of-the-art models for handling irregularly sampled time series. Furthermore, our results demonstrate that our approach can improve representation learning in EHRs and can be useful for clinical prediction tasks.

Hierarchical Transformers are Efficient Meta-Reinforcement Learners

  • Authors: Gresa Shala, André Biedenkapp, Josif Grabocka
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2402.06402
  • Pdf link: https://arxiv.org/pdf/2402.06402
  • Abstract We introduce Hierarchical Transformers for Meta-Reinforcement Learning (HTrMRL), a powerful online meta-reinforcement learning approach. HTrMRL aims to address the challenge of enabling reinforcement learning agents to perform effectively in previously unseen tasks. We demonstrate how past episodes serve as a rich source of information, which our model effectively distills and applies to new contexts. Our learned algorithm is capable of outperforming the previous state-of-the-art and provides more efficient meta-training while significantly improving generalization capabilities. Experimental results, obtained across various simulated tasks of the Meta-World Benchmark, indicate a significant improvement in learning efficiency and adaptability compared to the state-of-the-art on a variety of tasks. Our approach not only enhances the agent's ability to generalize from limited data but also paves the way for more robust and versatile AI systems.

Trust the Process: Zero-Knowledge Machine Learning to Enhance Trust in Generative AI Interactions

  • Authors: Bianca-Mihaela Ganescu, Jonathan Passerat-Palmbach
  • Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2402.06414
  • Pdf link: https://arxiv.org/pdf/2402.06414
  • Abstract Generative AI, exemplified by models like transformers, has opened up new possibilities in various domains but also raised concerns about fairness, transparency and reliability, especially in fields like medicine and law. This paper emphasizes the urgency of ensuring fairness and quality in these domains through generative AI. It explores using cryptographic techniques, particularly Zero-Knowledge Proofs (ZKPs), to address concerns regarding performance fairness and accuracy while protecting model privacy. Applying ZKPs to Machine Learning models, known as ZKML (Zero-Knowledge Machine Learning), enables independent validation of AI-generated content without revealing sensitive model information, promoting transparency and trust. ZKML enhances AI fairness by providing cryptographic audit trails for model predictions and ensuring uniform performance across users. We introduce snarkGPT, a practical ZKML implementation for transformers, to empower users to verify output accuracy and quality while preserving model privacy. We present a series of empirical results studying snarkGPT's scalability and performance to assess the feasibility and challenges of adopting a ZKML-powered approach to capture quality and performance fairness problems in generative AI models.

CurveFormer++: 3D Lane Detection by Curve Propagation with Temporal Curve Queries and Attention

  • Authors: Yifeng Bai, Zhirong Chen, Pengpeng Liang, Erkang Cheng
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2402.06423
  • Pdf link: https://arxiv.org/pdf/2402.06423
  • Abstract In autonomous driving, 3D lane detection using monocular cameras is an important task for various downstream planning and control tasks. Recent CNN and Transformer approaches usually apply a two-stage scheme in the model design. The first stage transforms the image feature from a front image into a bird's-eye-view (BEV) representation. Subsequently, a sub-network processes the BEV feature map to generate the 3D detection results. However, these approaches heavily rely on a challenging image feature transformation module from a perspective view to a BEV representation. In our work, we present CurveFormer++, a single-stage Transformer-based method that does not require the image feature view transform module and directly infers 3D lane detection results from the perspective image features. Specifically, our approach models the 3D detection task as a curve propagation problem, where each lane is represented by a curve query with a dynamic and ordered anchor point set. By employing a Transformer decoder, the model can iteratively refine the 3D lane detection results. A curve cross-attention module is introduced in the Transformer decoder to calculate similarities between image features and curve queries of lanes. To handle varying lane lengths, we employ context sampling and anchor point restriction techniques to compute more relevant image features for a curve query. Furthermore, we apply a temporal fusion module that incorporates selected informative sparse curve queries and their corresponding anchor point sets to leverage historical lane information. In the experiments, we evaluate our approach for the 3D lane detection task on two publicly available real-world datasets. The results demonstrate that our method provides outstanding performance compared with both CNN and Transformer based methods. We also conduct ablation studies to analyze the impact of each component in our approach.

Inducing Systematicity in Transformers by Attending to Structurally Quantized Embeddings

  • Authors: Yichen Jiang, Xiang Zhou, Mohit Bansal
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2402.06492
  • Pdf link: https://arxiv.org/pdf/2402.06492
  • Abstract Transformers generalize to novel compositions of structures and entities after being trained on a complex dataset, but easily overfit on datasets of insufficient complexity. We observe that when the training set is sufficiently complex, the model encodes sentences that have a common syntactic structure using a systematic attention pattern. Inspired by this observation, we propose SQ-Transformer (Structurally Quantized) that explicitly encourages systematicity in the embeddings and attention layers, even with a training set of low complexity. At the embedding level, we introduce Structure-oriented Vector Quantization (SoVQ) to cluster word embeddings into several classes of structurally equivalent entities. At the attention level, we devise the Systematic Attention Layer (SAL) and an alternative, Systematically Regularized Layer (SRL) that operate on the quantized word embeddings so that sentences of the same structure are encoded with invariant or similar attention patterns. Empirically, we show that SQ-Transformer achieves stronger compositional generalization than the vanilla Transformer on multiple low-complexity semantic parsing and machine translation datasets. In our analysis, we show that SoVQ indeed learns a syntactically clustered embedding space and SAL/SRL induces generalizable attention patterns, which lead to improved systematicity.

Distilling Morphology-Conditioned Hypernetworks for Efficient Universal Morphology Control

  • Authors: Zheng Xiong, Risto Vuorio, Jacob Beck, Matthieu Zimmer, Kun Shao, Shimon Whiteson
  • Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2402.06570
  • Pdf link: https://arxiv.org/pdf/2402.06570
  • Abstract Learning a universal policy across different robot morphologies can significantly improve learning efficiency and enable zero-shot generalization to unseen morphologies. However, learning a highly performant universal policy requires sophisticated architectures like transformers (TF) that have larger memory and computational cost than simpler multi-layer perceptrons (MLP). To achieve both good performance like TF and high efficiency like MLP at inference time, we propose HyperDistill, which consists of: (1) A morphology-conditioned hypernetwork (HN) that generates robot-wise MLP policies, and (2) A policy distillation approach that is essential for successful training. We show that on UNIMAL, a benchmark with hundreds of diverse morphologies, HyperDistill performs as well as a universal TF teacher policy on both training and unseen test robots, but reduces model size by 6-14 times, and computational cost by 67-160 times in different environments. Our analysis attributes the efficiency advantage of HyperDistill at inference time to knowledge decoupling, i.e., the ability to decouple inter-task and intra-task knowledge, a general principle that could also be applied to improve inference efficiency in other domains.

Keyword: scene understanding

There is no result

Keyword: visual reasoning

There is no result

DongZhouGu avatar Feb 12 '24 02:02 DongZhouGu