arxiv-updates icon indicating copy to clipboard operation
arxiv-updates copied to clipboard

New submissions for Fri, 10 Nov 23

Open zoq opened this issue 1 year ago • 0 comments

Keyword: sgd

There is no result

Keyword: optimization

Is one brick enough to break the wall of spoken dialogue state tracking?

  • Authors: Authors: Lucas Druart (LIA), Valentin Vielzeuf, Yannick Estève (LIA)
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2311.04923
  • Pdf link: https://arxiv.org/pdf/2311.04923
  • Abstract In Task-Oriented Dialogue (TOD) systems, correctly updating the system's understanding of the user's needs (a.k.a dialogue state tracking) is key to a smooth interaction. Traditionally, TOD systems perform this update in three steps: transcription of the user's utterance, semantic extraction of the key concepts, and contextualization with the previously identified concepts. Such cascade approaches suffer from cascading errors and separate optimization. End-to-End approaches have been proved helpful up to the semantic extraction step. This paper goes one step further paving the path towards completely neural spoken dialogue state tracking by comparing three approaches: (1) a state of the art cascade approach, (2) a locally E2E approach with rule-based contextualization and (3) a completely neural approach. Our study highlights that although they all outperform the recent DSTC11 best model, especially with a filtering post-processing step, (1) remains the most accurate approach. Indeed, both (2) and (3) have trouble propagating context as dialogues unfold showing that context propagation in completely neural approaches is an open challenge.

Edge-assisted U-Shaped Split Federated Learning with Privacy-preserving for Internet of Things

  • Authors: Authors: Hengliang Tang, Zihang Zhao, Detian Liu, Yang Cao, Shiqiang Zhang, Siqing You
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2311.04944
  • Pdf link: https://arxiv.org/pdf/2311.04944
  • Abstract In the realm of the Internet of Things (IoT), deploying deep learning models to process data generated or collected by IoT devices is a critical challenge. However, direct data transmission can cause network congestion and inefficient execution, given that IoT devices typically lack computation and communication capabilities. Centralized data processing in data centers is also no longer feasible due to concerns over data privacy and security. To address these challenges, we present an innovative Edge-assisted U-Shaped Split Federated Learning (EUSFL) framework, which harnesses the high-performance capabilities of edge servers to assist IoT devices in model training and optimization process. In this framework, we leverage Federated Learning (FL) to enable data holders to collaboratively train models without sharing their data, thereby enhancing data privacy protection by transmitting only model parameters. Additionally, inspired by Split Learning (SL), we split the neural network into three parts using U-shaped splitting for local training on IoT devices. By exploiting the greater computation capability of edge servers, our framework effectively reduces overall training time and allows IoT devices with varying capabilities to perform training tasks efficiently. Furthermore, we proposed a novel noise mechanism called LabelDP to ensure that data features and labels can securely resist reconstruction attacks, eliminating the risk of privacy leakage. Our theoretical analysis and experimental results demonstrate that EUSFL can be integrated with various aggregation algorithms, maintaining good performance across different computing capabilities of IoT devices, and significantly reducing training time and local computation overhead.

Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO

  • Authors: Authors: Haim Barad, Ekaterina Aidova, Yury Gorbachev
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
  • Arxiv link: https://arxiv.org/abs/2311.04951
  • Pdf link: https://arxiv.org/pdf/2311.04951
  • Abstract Inference optimizations are critical for improving user experience and reducing infrastructure costs and power consumption. In this article, we illustrate a form of dynamic execution known as speculative sampling to reduce the overall latency of text generation and compare it with standard autoregressive sampling. This can be used together with model-based optimizations (e.g. quantization) to provide an optimized solution. Both sampling methods make use of KV caching. A Jupyter notebook and some sample executions are provided.

DEMASQ: Unmasking the ChatGPT Wordsmith

  • Authors: Authors: Kavita Kumari, Alessandro Pegoraro, Hossein Fereidooni, Ahmad-Reza Sadeghi
  • Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05019
  • Pdf link: https://arxiv.org/pdf/2311.05019
  • Abstract The potential misuse of ChatGPT and other Large Language Models (LLMs) has raised concerns regarding the dissemination of false information, plagiarism, academic dishonesty, and fraudulent activities. Consequently, distinguishing between AI-generated and human-generated content has emerged as an intriguing research topic. However, current text detection methods lack precision and are often restricted to specific tasks or domains, making them inadequate for identifying content generated by ChatGPT. In this paper, we propose an effective ChatGPT detector named DEMASQ, which accurately identifies ChatGPT-generated content. Our method addresses two critical factors: (i) the distinct biases in text composition observed in human- and machine-generated content and (ii) the alterations made by humans to evade previous detection methods. DEMASQ is an energy-based detection model that incorporates novel aspects, such as (i) optimization inspired by the Doppler effect to capture the interdependence between input text embeddings and output labels, and (ii) the use of explainable AI techniques to generate diverse perturbations. To evaluate our detector, we create a benchmark dataset comprising a mixture of prompts from both ChatGPT and humans, encompassing domains such as medical, open Q&A, finance, wiki, and Reddit. Our evaluation demonstrates that DEMASQ achieves high accuracy in identifying content generated by ChatGPT.

Geometry-Calibrated DRO: Combating Over-Pessimism with Free Energy Implications

  • Authors: Authors: Jiashuo Liu, Jiayun Wu, Tianyu Wang, Hao Zou, Bo Li, Peng Cui
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2311.05054
  • Pdf link: https://arxiv.org/pdf/2311.05054
  • Abstract Machine learning algorithms minimizing average risk are susceptible to distributional shifts. Distributionally Robust Optimization (DRO) addresses this issue by optimizing the worst-case risk within an uncertainty set. However, DRO suffers from over-pessimism, leading to low-confidence predictions, poor parameter estimations as well as poor generalization. In this work, we conduct a theoretical analysis of a probable root cause of over-pessimism: excessive focus on noisy samples. To alleviate the impact of noise, we incorporate data geometry into calibration terms in DRO, resulting in our novel Geometry-Calibrated DRO (GCDRO) for regression. We establish the connection between our risk objective and the Helmholtz free energy in statistical physics, and this free-energy-based risk can extend to standard DRO methods. Leveraging gradient flow in Wasserstein space, we develop an approximate minimax optimization algorithm with a bounded error ratio and elucidate how our approach mitigates noisy sample effects. Comprehensive experiments confirm GCDRO's superiority over conventional DRO methods.

Accelerating Exploration with Unlabeled Prior Data

  • Authors: Authors: Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, Sergey Levine
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2311.05067
  • Pdf link: https://arxiv.org/pdf/2311.05067
  • Abstract Learning to solve tasks from a sparse reward signal is a major challenge for standard reinforcement learning (RL) algorithms. However, in the real world, agents rarely need to solve sparse reward tasks entirely from scratch. More often, we might possess prior experience to draw on that provides considerable guidance about which actions and outcomes are possible in the world, which we can use to explore more effectively for new tasks. In this work, we study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task. We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization. This general formula leads to rapid exploration in several challenging sparse-reward domains where tabula rasa exploration is insufficient, including the AntMaze domain, Adroit hand manipulation domain, and a visual simulated robotic manipulation domain. Our results highlight the ease of incorporating unlabeled prior data into existing online RL algorithms, and the (perhaps surprising) effectiveness of doing so.

On neural and dimensional collapse in supervised and unsupervised contrastive learning with hard negative sampling

  • Authors: Authors: Ruijie Jiang, Thuan Nguyen, Shuchin Aeron, Prakash Ishwar
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05139
  • Pdf link: https://arxiv.org/pdf/2311.05139
  • Abstract For a widely-studied data model and general loss and sample-hardening functions we prove that the Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) risks are minimized by representations that exhibit Neural Collapse (NC), i.e., the class means form an Equianglular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) risks are lower bounded by the corresponding SCL and UCL risks. Although the optimality of ETF is known for SCL, albeit only for InfoNCE loss, its optimality for HSCL and UCL under general loss and hardening functions is novel. Moreover, our proofs are much simpler, compact, and transparent. We empirically demonstrate, for the first time, that ADAM optimization of HSCL and HUCL risks with random initialization and suitable hardness levels can indeed converge to the NC geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard negatives or feature normalization, however, the representations learned via ADAM suffer from dimensional collapse (DC) and fail to attain the NC geometry.

Challenges and Opportunities in the Co-design of Convolutions and RISC-V Vector Processors

  • Authors: Authors: Sonia Rani Gupta, Nikela Papadopoulou, Miquel Pericàs
  • Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
  • Arxiv link: https://arxiv.org/abs/2311.05284
  • Pdf link: https://arxiv.org/pdf/2311.05284
  • Abstract The RISC-V "V" extension introduces vector processing to the RISC-V architecture. Unlike most SIMD extensions, it supports long vectors which can result in significant improvement of multiple applications. In this paper, we present our ongoing research to implement and optimize a vectorized Winograd algorithm used in convolutional layers on RISC-V Vector(RISC-VV) processors. Our study identifies effective techniques for optimizing the kernels of Winograd on RISC-VV using intrinsic instructions, and showcases how certain instructions offer better performance. Our co-design findings suggest that the Winograd algorithm benefits from vector lengths up to 2048 bits and cache sizes up to 64MB. We use our experience with Winograd to highlight potential enhancements for the standard that would simplify code generation and aid low-level programming. Finally, we share our experience from experimenting with forks of gem5 for RISC-VV and stress the importance of a mature software ecosystem, to facilitate design space exploration and architectural optimization.

VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for Enhanced Indoor View Synthesis

  • Authors: Authors: Sen Wang, Wei Zhang, Stefano Gasperini, Shun-Cheng Wu, Nassir Navab
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.05289
  • Pdf link: https://arxiv.org/pdf/2311.05289
  • Abstract Creating high-quality view synthesis is essential for immersive applications but continues to be problematic, particularly in indoor environments and for real-time deployment. Current techniques frequently require extensive computational time for both training and rendering, and often produce less-than-ideal 3D representations due to inadequate geometric structuring. To overcome this, we introduce VoxNeRF, a novel approach that leverages volumetric representations to enhance the quality and efficiency of indoor view synthesis. Firstly, VoxNeRF constructs a structured scene geometry and converts it into a voxel-based representation. We employ multi-resolution hash grids to adaptively capture spatial features, effectively managing occlusions and the intricate geometry of indoor scenes. Secondly, we propose a unique voxel-guided efficient sampling technique. This innovation selectively focuses computational resources on the most relevant portions of ray segments, substantially reducing optimization time. We validate our approach against three public indoor datasets and demonstrate that VoxNeRF outperforms state-of-the-art methods. Remarkably, it achieves these gains while reducing both training and rendering times, surpassing even Instant-NGP in speed and bringing the technology closer to real-time.

Reliable and Efficient Data Collection in UAV-based IoT Networks

  • Authors: Authors: Poorvi Joshi (1), Alakesh Kalita (2), Mohan Gurusamy (1) ((1) National University of Singapore, (2) Singapore University of Technology and Design)
  • Subjects: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG); Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2311.05303
  • Pdf link: https://arxiv.org/pdf/2311.05303
  • Abstract Internet of Things (IoT) involves sensors for monitoring and wireless networks for efficient communication. However, resource-constrained IoT devices and limitations in existing wireless technologies hinder its full potential. Integrating Unmanned Aerial Vehicles (UAVs) into IoT networks can address some challenges by expanding its' coverage, providing security, and bringing computing closer to IoT devices. Nevertheless, effective data collection in UAV-assisted IoT networks is hampered by factors, including dynamic UAV behavior, environmental variables, connectivity instability, and security considerations. In this survey, we first explore UAV-based IoT networks, focusing on communication and networking aspects. Next, we cover various UAV-based data collection methods their advantages and disadvantages, followed by a discussion on performance metrics for data collection. As this article primarily emphasizes reliable and efficient data collection in UAV-assisted IoT networks, we briefly discuss existing research on data accuracy and consistency, network connectivity, and data security and privacy to provide insights into reliable data collection. Additionally, we discuss efficient data collection strategies in UAV-based IoT networks, covering trajectory and path planning, collision avoidance, sensor network clustering, data aggregation, UAV swarm formations, and artificial intelligence for optimization. We also present two use cases of UAVs as a service for enhancing data collection reliability and efficiency. Finally, we discuss future challenges in data collection for UAV-assisted IoT networks.

Beyond the training set: an intuitive method for detecting distribution shift in model-based optimization

  • Authors: Authors: Farhan Damani, David H Brookes, Theodore Sternlieb, Cameron Webster, Stephen Malina, Rishi Jajoo, Kathy Lin, Sam Sinai
  • Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
  • Arxiv link: https://arxiv.org/abs/2311.05363
  • Pdf link: https://arxiv.org/pdf/2311.05363
  • Abstract Model-based optimization (MBO) is increasingly applied to design problems in science and engineering. A common scenario involves using a fixed training set to train models, with the goal of designing new samples that outperform those present in the training data. A major challenge in this setting is distribution shift, where the distributions of training and design samples are different. While some shift is expected, as the goal is to create better designs, this change can negatively affect model accuracy and subsequently, design quality. Despite the widespread nature of this problem, addressing it demands deep domain knowledge and artful application. To tackle this issue, we propose a straightforward method for design practitioners that detects distribution shifts. This method trains a binary classifier using knowledge of the unlabeled design distribution to separate the training data from the design data. The classifier's logit scores are then used as a proxy measure of distribution shift. We validate our method in a real-world application by running offline MBO and evaluate the effect of distribution shift on design quality. We find that the intensity of the shift in the design distribution varies based on the number of steps taken by the optimization algorithm, and our simple approach can identify these shifts. This enables users to constrain their search to regions where the model's predictions are reliable, thereby increasing the quality of designs.

Joint Angle and Delay Cramér-Rao Bound Optimization for Integrated Sensing and Communications

  • Authors: Authors: Chao Hu, Yuan Fang, Ling Qiu
  • Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2311.05372
  • Pdf link: https://arxiv.org/pdf/2311.05372
  • Abstract In this paper, we study a multi-input multi-output (MIMO) beamforming design in an integrated sensing and communication (ISAC) system, in which an ISAC base station (BS) is used to communicate with multiple downlink users and simultaneously the communication signals are reused for sensing multiple targets. Our interested sensing parameters are the angle and delay information of the targets, which can be used to locate these targets. Under this consideration, we first derive the Cram'{e}r-Rao bound (CRB) for angle and delay estimation. Then, we optimize the transmit beamforming at the BS to minimize the CRB, subject to communication rate and power constraints. In particular, we obtain the optimal solution in closed-form in the case of single-target and single-user, and in the case of multi-target and multi-user scenario, the sparsity of the optimal solution is proven, leading to a reduction in computational complexity during optimization. The numerical results demonstrate that the optimized beamforming yields excellent positioning performance and effectively reduces the requirement for a large number of antennas at the BS.

The Sample Complexity Of ERMs In Stochastic Convex Optimization

  • Authors: Authors: Daniel Carmon, Roi Livni, Amir Yehudayoff
  • Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2311.05398
  • Pdf link: https://arxiv.org/pdf/2311.05398
  • Abstract Stochastic convex optimization is one of the most well-studied models for learning in modern machine learning. Nevertheless, a central fundamental question in this setup remained unresolved: "How many data points must be observed so that any empirical risk minimizer (ERM) shows good performance on the true population?" This question was proposed by Feldman (2016), who proved that $\Omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are necessary (where $d$ is the dimension and $\epsilon>0$ is the accuracy parameter). Proving an $\omega(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ lower bound was left as an open problem. In this work we show that in fact $\tilde{O}(\frac{d}{\epsilon}+\frac{1}{\epsilon^2})$ data points are also sufficient. This settles the question and yields a new separation between ERMs and uniform convergence. This sample complexity holds for the classical setup of learning bounded convex Lipschitz functions over the Euclidean unit ball. We further generalize the result and show that a similar upper bound holds for all symmetric convex bodies. The general bound is composed of two terms: (i) a term of the form $\tilde{O}(\frac{d}{\epsilon})$ with an inverse-linear dependence on the accuracy parameter, and (ii) a term that depends on the statistical complexity of the class of $\textit{linear}$ functions (captured by the Rademacher complexity). The proof builds a mechanism for controlling the behavior of stochastic convex optimization problems.

Inference for Probabilistic Dependency Graphs

  • Authors: Authors: Oliver E. Richardson, Joseph Y. Halpern, Christopher De Sa
  • Subjects: Data Structures and Algorithms (cs.DS); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Probability (math.PR)
  • Arxiv link: https://arxiv.org/abs/2311.05580
  • Pdf link: https://arxiv.org/pdf/2311.05580
  • Abstract Probabilistic dependency graphs (PDGs) are a flexible class of probabilistic graphical models, subsuming Bayesian Networks and Factor Graphs. They can also capture inconsistent beliefs, and provide a way of measuring the degree of this inconsistency. We present the first tractable inference algorithm for PDGs with discrete variables, making the asymptotic complexity of PDG inference similar that of the graphical models they generalize. The key components are: (1) the observation that, in many cases, the distribution a PDG specifies can be formulated as a convex optimization problem (with exponential cone constraints), (2) a construction that allows us to express these problems compactly for PDGs of boundeed treewidth, (3) contributions to the theory of PDGs that justify the construction, and (4) an appeal to interior point methods that can solve such problems in polynomial time. We verify the correctness and complexity of our approach, and provide an implementation of it. We then evaluate our implementation, and demonstrate that it outperforms baseline approaches. Our code is available at this http URL

A Coefficient Makes SVRG Effective

  • Authors: Authors: Yida Yin, Zhiqiu Xu, Zhiyuan Li, Trevor Darrell, Zhuang Liu
  • Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2311.05589
  • Pdf link: https://arxiv.org/pdf/2311.05589
  • Abstract Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlights, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our analysis finds that, for deeper networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes neural networks, consistently reducing training loss compared to both baseline and the standard SVRG across various architectures and image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at https://github.com/davidyyd/alpha-SVRG.

FogROS2-Sky: Optimizing Latency and Cost for Multi-Cloud Robot Applications

  • Authors: Authors: Kaiyuan Chen, Kush Hari, Rohil Khare, Charlotte Le, Trinity Chung, Jaimyn Drake, Karthik Dharmarajan, Simeon Adebola, Jeffrey Ichnowski, John Kubiatowicz, Ken Goldberg
  • Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2311.05600
  • Pdf link: https://arxiv.org/pdf/2311.05600
  • Abstract This paper studies the cost-performance tradeoffs in cloud robotics with heterogeneous cloud service providers, which have complex pricing models and varying application requirements. We present FogROS2-Sky, a cost-efficient open source robotics platform that offloads unmodified ROS2 applications to multiple cloud providers and enables fine-grained cost analysis for ROS2 applications' communication with multiple cloud providers. As each provider offers different options for CPU, GPU, memory, and latency, it can be very difficult for users to decide which to choose. FogROS2-Sky includes an optimization algorithm, which either finds the best available hardware specification that fulfills the user's latency and cost constraints or reports that such a specification does not exist. We use FogROS2-Sky to perform time-cost analysis on three robotics applications: visual SLAM, grasp planning, and motion planning. We are able to sample different hardware setups at nearly half the cost while still create cost and latency functions suitable for the optimizer. We also evaluate the optimizer's efficacy for these applications with the Pareto frontier and show that the optimizer selects efficient hardware configurations to balance cost and latency. Videos and code are available on the website https://sites.google.com/view/fogros2-sky

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

  • Authors: Authors: Johannes Hagemann, Samuel Weinbach, Konstantin Dobler, Maximilian Schall, Gerard de Melo
  • Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
  • Arxiv link: https://arxiv.org/abs/2311.05610
  • Pdf link: https://arxiv.org/pdf/2311.05610
  • Abstract Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a 13B model.

Keyword: adam

On neural and dimensional collapse in supervised and unsupervised contrastive learning with hard negative sampling

  • Authors: Authors: Ruijie Jiang, Thuan Nguyen, Shuchin Aeron, Prakash Ishwar
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05139
  • Pdf link: https://arxiv.org/pdf/2311.05139
  • Abstract For a widely-studied data model and general loss and sample-hardening functions we prove that the Supervised Contrastive Learning (SCL), Hard-SCL (HSCL), and Unsupervised Contrastive Learning (UCL) risks are minimized by representations that exhibit Neural Collapse (NC), i.e., the class means form an Equianglular Tight Frame (ETF) and data from the same class are mapped to the same representation. We also prove that for any representation mapping, the HSCL and Hard-UCL (HUCL) risks are lower bounded by the corresponding SCL and UCL risks. Although the optimality of ETF is known for SCL, albeit only for InfoNCE loss, its optimality for HSCL and UCL under general loss and hardening functions is novel. Moreover, our proofs are much simpler, compact, and transparent. We empirically demonstrate, for the first time, that ADAM optimization of HSCL and HUCL risks with random initialization and suitable hardness levels can indeed converge to the NC geometry if we incorporate unit-ball or unit-sphere feature normalization. Without incorporating hard negatives or feature normalization, however, the representations learned via ADAM suffer from dimensional collapse (DC) and fail to attain the NC geometry.

Keyword: gradient

Lightweight Diffusion Models with Distillation-Based Block Neural Architecture Search

  • Authors: Authors: Siao Tang, Xin Wang, Hong Chen, Chaoyu Guan, Yansong Tang, Wenwu zhu
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.04950
  • Pdf link: https://arxiv.org/pdf/2311.04950
  • Abstract Diffusion models have recently shown remarkable generation ability, achieving state-of-the-art performance in many tasks. However, the high computational cost is still a troubling problem for diffusion models. To tackle this problem, we propose to automatically remove the structural redundancy in diffusion models with our proposed Diffusion Distillation-based Block-wise Neural Architecture Search (DiffNAS). Specifically, given a larger pretrained teacher, we leverage DiffNAS to search for the smallest architecture which achieves on-par or even better performance than the teacher. Considering current diffusion models are based on UNet which naturally has a block-wise structure, we perform neural architecture search independently in each block, which largely reduces the search space. Different from previous block-wise NAS methods, DiffNAS contains a block-wise local search strategy and a retraining strategy with a joint dynamic loss. Concretely, during the search process, we block-wisely select the best subnet to avoid the unfairness brought by the global search strategy used in previous works. When retraining the searched architecture, we adopt a dynamic joint loss to maintain the consistency between supernet training and subnet retraining, which also provides informative objectives for each block and shortens the paths of gradient propagation. We demonstrate this joint loss can effectively improve model performance. We also prove the necessity of the dynamic adjustment of this loss. The experiments show that our method can achieve significant computational reduction, especially on latent diffusion models with about 50% MACs and Parameter reduction.

Familiarity-Based Open-Set Recognition Under Adversarial Attacks

  • Authors: Authors: Philip Enevoldsen, Christian Gundersen, Nico Lang, Serge Belongie, Christian Igel
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05006
  • Pdf link: https://arxiv.org/pdf/2311.05006
  • Abstract Open-set recognition (OSR), the identification of novel categories, can be a critical component when deploying classification models in real-world applications. Recent work has shown that familiarity-based scoring rules such as the Maximum Softmax Probability (MSP) or the Maximum Logit Score (MLS) are strong baselines when the closed-set accuracy is high. However, one of the potential weaknesses of familiarity-based OSR are adversarial attacks. Here, we present gradient-based adversarial attacks on familiarity scores for both types of attacks, False Familiarity and False Novelty attacks, and evaluate their effectiveness in informed and uninformed settings on TinyImageNet.

Reinforcement Learning Generalization for Nonlinear Systems Through Dual-Scale Homogeneity Transformations

  • Authors: Authors: Abdel Gafoor Haddad, Igor Boiko, Yahya Zweiri
  • Subjects: Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2311.05013
  • Pdf link: https://arxiv.org/pdf/2311.05013
  • Abstract Reinforcement learning is an emerging approach to control dynamical systems for which classical approaches are difficult to apply. However, trained agents may not generalize against the variations of system parameters. This paper presents the concept of dual-scale homogeneity, an important property in understating the scaling behavior of nonlinear systems. Furthermore, it also presents an effective yet simple approach to designing a parameter-dependent control law that homogenizes a nonlinear system. The presented approach is applied to two systems, demonstrating its ability to provide a consistent performance irrespective of parameters variations. To demonstrate the practicality of the proposed approach, the control policy is generated by a deep deterministic policy gradient to control the load position of a quadrotor with a slung load. The proposed synergy between the homogeneity transformations and reinforcement learning yields superior performance compared to other recent learning-based control techniques. It achieves a success rate of 96% in bringing the load to its designated target with a 3D RMSE of 0.0253 m. The video that shows the experimental results along with a summary of the paper is available at this link.

Geometry-Calibrated DRO: Combating Over-Pessimism with Free Energy Implications

  • Authors: Authors: Jiashuo Liu, Jiayun Wu, Tianyu Wang, Hao Zou, Bo Li, Peng Cui
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2311.05054
  • Pdf link: https://arxiv.org/pdf/2311.05054
  • Abstract Machine learning algorithms minimizing average risk are susceptible to distributional shifts. Distributionally Robust Optimization (DRO) addresses this issue by optimizing the worst-case risk within an uncertainty set. However, DRO suffers from over-pessimism, leading to low-confidence predictions, poor parameter estimations as well as poor generalization. In this work, we conduct a theoretical analysis of a probable root cause of over-pessimism: excessive focus on noisy samples. To alleviate the impact of noise, we incorporate data geometry into calibration terms in DRO, resulting in our novel Geometry-Calibrated DRO (GCDRO) for regression. We establish the connection between our risk objective and the Helmholtz free energy in statistical physics, and this free-energy-based risk can extend to standard DRO methods. Leveraging gradient flow in Wasserstein space, we develop an approximate minimax optimization algorithm with a bounded error ratio and elucidate how our approach mitigates noisy sample effects. Comprehensive experiments confirm GCDRO's superiority over conventional DRO methods.

Efficient Compression of Overparameterized Deep Models through Low-Dimensional Learning Dynamics

  • Authors: Authors: Soo Min Kwon, Zekai Zhang, Dogyoon Song, Laura Balzano, Qing Qu
  • Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2311.05061
  • Pdf link: https://arxiv.org/pdf/2311.05061
  • Abstract Overparameterized models have proven to be powerful tools for solving various machine learning tasks. However, overparameterization often leads to a substantial increase in computational and memory costs, which in turn requires extensive resources to train. In this work, we aim to reduce this complexity by studying the learning dynamics of overparameterized deep networks. By extensively studying its learning dynamics, we unveil that the weight matrices of various architectures exhibit a low-dimensional structure. This finding implies that we can compress the networks by reducing the training to a small subspace. We take a step in developing a principled approach for compressing deep networks by studying deep linear models. We demonstrate that the principal components of deep linear models are fitted incrementally but within a small subspace, and use these insights to compress deep linear networks by decreasing the width of its intermediate layers. Remarkably, we observe that with a particular choice of initialization, the compressed network converges faster than the original network, consistently yielding smaller recovery errors throughout all iterations of gradient descent. We substantiate this observation by developing a theory focused on the deep matrix factorization problem, and by conducting empirical evaluations on deep matrix sensing. Finally, we demonstrate how our compressed model can enhance the utility of deep nonlinear models. Overall, we observe that our compression technique accelerates the training process by more than 2x, without compromising model quality.

Exploring and Analyzing Wildland Fire Data Via Machine Learning Techniques

  • Authors: Authors: Dipak Dulal, Joseph J. Charney, Michael Gallagher, Carmeliza Navasca, Nicholas Skowronski
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05128
  • Pdf link: https://arxiv.org/pdf/2311.05128
  • Abstract This research project investigated the correlation between a 10 Hz time series of thermocouple temperatures and turbulent kinetic energy (TKE) computed from wind speeds collected from a small experimental prescribed burn at the Silas Little Experimental Forest in New Jersey, USA. The primary objective of this project was to explore the potential for using thermocouple temperatures as predictors for estimating the TKE produced by a wildland fire. Machine learning models, including Deep Neural Networks, Random Forest Regressor, Gradient Boosting, and Gaussian Process Regressor, are employed to assess the potential for thermocouple temperature perturbations to predict TKE values. Data visualization and correlation analyses reveal patterns and relationships between thermocouple temperatures and TKE, providing insight into the underlying dynamics. The project achieves high accuracy in predicting TKE by employing various machine learning models despite a weak correlation between the predictors and the target variable. The results demonstrate significant success, particularly from regression models, in accurately estimating the TKE. The research findings contribute to fire behavior and smoke modeling science, emphasizing the importance of incorporating machine learning approaches and identifying complex relationships between fine-scale fire behavior and turbulence. Accurate TKE estimation using thermocouple temperatures allows for the refinement of models that can inform decision-making in fire management strategies, facilitate effective risk mitigation, and optimize fire management efforts. This project highlights the valuable role of machine learning techniques in analyzing wildland fire data, showcasing their potential to advance fire research and management practices.

When Meta-Learning Meets Online and Continual Learning: A Survey

  • Authors: Authors: Jaehyeon Son, Soochan Lee, Gunhee Kim
  • Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2311.05241
  • Pdf link: https://arxiv.org/pdf/2311.05241
  • Abstract Over the past decade, deep neural networks have demonstrated significant success using the training scheme that involves mini-batch stochastic gradient descent on extensive datasets. Expanding upon this accomplishment, there has been a surge in research exploring the application of neural networks in other learning scenarios. One notable framework that has garnered significant attention is meta-learning. Often described as "learning to learn," meta-learning is a data-driven approach to optimize the learning algorithm. Other branches of interest are continual learning and online learning, both of which involve incrementally updating a model with streaming data. While these frameworks were initially developed independently, recent works have started investigating their combinations, proposing novel problem settings and learning algorithms. However, due to the elevated complexity and lack of unified terminology, discerning differences between the learning frameworks can be challenging even for experienced researchers. To facilitate a clear understanding, this paper provides a comprehensive survey that organizes various problem settings using consistent terminology and formal descriptions. By offering an overview of these learning paradigms, our work aims to foster further advancements in this promising area of research.

ABIGX: A Unified Framework for eXplainable Fault Detection and Classification

  • Authors: Authors: Yue Zhuo, Jinchuan Qian, Zhihuan Song, Zhiqiang Ge
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2311.05316
  • Pdf link: https://arxiv.org/pdf/2311.05316
  • Abstract For explainable fault detection and classification (FDC), this paper proposes a unified framework, ABIGX (Adversarial fault reconstruction-Based Integrated Gradient eXplanation). ABIGX is derived from the essentials of previous successful fault diagnosis methods, contribution plots (CP) and reconstruction-based contribution (RBC). It is the first explanation framework that provides variable contributions for the general FDC models. The core part of ABIGX is the adversarial fault reconstruction (AFR) method, which rethinks the FR from the perspective of adversarial attack and generalizes to fault classification models with a new fault index. For fault classification, we put forward a new problem of fault class smearing, which intrinsically hinders the correct explanation. We prove that ABIGX effectively mitigates this problem and outperforms the existing gradient-based explanation methods. For fault detection, we theoretically bridge ABIGX with conventional fault diagnosis methods by proving that CP and RBC are the linear specifications of ABIGX. The experiments evaluate the explanations of FDC by quantitative metrics and intuitive illustrations, the results of which show the general superiority of ABIGX to other advanced explanation methods.

Accelerated Shapley Value Approximation for Data Evaluation

  • Authors: Authors: Lauren Watson, Zeno Kujawa, Rayna Andreeva, Hao-Tsung Yang, Tariq Elahi, Rik Sarkar
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.05346
  • Pdf link: https://arxiv.org/pdf/2311.05346
  • Abstract Data valuation has found various applications in machine learning, such as data filtering, efficient learning and incentives for data sharing. The most popular current approach to data valuation is the Shapley value. While popular for its various applications, Shapley value is computationally expensive even to approximate, as it requires repeated iterations of training models on different subsets of data. In this paper we show that the Shapley value of data points can be approximated more efficiently by leveraging the structural properties of machine learning problems. We derive convergence guarantees on the accuracy of the approximate Shapley value for different learning settings including Stochastic Gradient Descent with convex and non-convex loss functions. Our analysis suggests that in fact models trained on small subsets are more important in the context of data valuation. Based on this idea, we describe $\delta$-Shapley -- a strategy of only using small subsets for the approximation. Experiments show that this approach preserves approximate value and rank of data, while achieving speedup of up to 9.9x. In pre-trained networks the approach is found to bring more efficiency in terms of accurate evaluation using small subsets.

Airfoil generation and feature extraction using the conditional VAE-WGAN-gp

  • Authors: Authors: Kazuo Yonekura, Yuki Tomori, Katsuyuki Suzuki
  • Subjects: Computational Engineering, Finance, and Science (cs.CE)
  • Arxiv link: https://arxiv.org/abs/2311.05445
  • Pdf link: https://arxiv.org/pdf/2311.05445
  • Abstract A machine learning method was applied to solve an inverse airfoil design problem. A conditional VAE-WGAN-gp model, which couples the conditional variational autoencoder (VAE) and Wasserstein generative adversarial network with gradient penalty (WGAN-gp), is proposed for an airfoil generation method, and then it is compared with the WGAN-gp and VAE models. The VAEGAN model couples the VAE and GAN models, which enables feature extraction in the GAN models. In airfoil generation tasks, to generate airfoil shapes that satisfy lift coefficient requirements, it is known that VAE outperforms WGAN-gp with respect to the accuracy of the reproduction of the lift coefficient, whereas GAN outperforms VAE with respect to the smoothness and variations of generated shapes. In this study, VAE-WGAN-gp demonstrated a good performance in all three aspects. Latent distribution was also studied to compare the feature extraction ability of the proposed method.

A Coefficient Makes SVRG Effective

  • Authors: Authors: Yida Yin, Zhiqiu Xu, Zhiyuan Li, Trevor Darrell, Zhuang Liu
  • Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2311.05589
  • Pdf link: https://arxiv.org/pdf/2311.05589
  • Abstract Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization method. However, as Defazio & Bottou (2019) highlights, its effectiveness in deep learning is yet to be proven. In this work, we demonstrate the potential of SVRG in optimizing real-world neural networks. Our analysis finds that, for deeper networks, the strength of the variance reduction term in SVRG should be smaller and decrease as training progresses. Inspired by this, we introduce a multiplicative coefficient $\alpha$ to control the strength and adjust it through a linear decay schedule. We name our method $\alpha$-SVRG. Our results show $\alpha$-SVRG better optimizes neural networks, consistently reducing training loss compared to both baseline and the standard SVRG across various architectures and image classification datasets. We hope our findings encourage further exploration into variance reduction techniques in deep learning. Code is available at https://github.com/davidyyd/alpha-SVRG.

Keyword: super-resolution

There is no result

zoq avatar Nov 10 '23 07:11 zoq