arxiv-updates icon indicating copy to clipboard operation
arxiv-updates copied to clipboard

New submissions for Mon, 9 Oct 23

Open zoq opened this issue 1 year ago • 0 comments

Keyword: sgd

Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL

  • Authors: Authors: Yang Yue, Rui Lu, Bingyi Kang, Shiji Song, Gao Huang
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2310.04411
  • Pdf link: https://arxiv.org/pdf/2310.04411
  • Abstract The divergence of the Q-value estimation has been a prominent issue in offline RL, where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence. For the first time, our theory can reliably decide whether the training will diverge at an early stage, and even predict the order of the growth for the estimated Q-value, the model's norm, and the crashing step when an SGD optimizer is used. The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely improving the model's architecture for better extrapolating behavior. Through extensive empirical studies, we identify LayerNorm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i.e. using only 1 transitions of the dataset, where all previous methods fail. Moreover, it can be easily plugged into modern offline RL methods and achieve SOTA results on many challenging tasks. We also give unique insights into its effectiveness.

Why Do We Need Weight Decay in Modern Deep Learning?

  • Authors: Authors: Maksym Andriushchenko, Francesco D'Angelo, Aditya Varre, Nicolas Flammarion
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2310.04415
  • Pdf link: https://arxiv.org/pdf/2310.04415
  • Abstract Weight decay is a broadly used technique for training state-of-the-art deep networks, including large language models. Despite its widespread usage, its role remains poorly understood. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For overparameterized deep networks, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for underparameterized large language models trained with nearly online SGD, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss. Moreover, we show that weight decay also prevents sudden loss divergences for bfloat16 mixed-precision training which is a crucial tool for LLM training. Overall, we present a unifying perspective from ResNets on vision tasks to LLMs: weight decay is never useful as an explicit regularizer but instead changes the training dynamics in a desirable way. Our code is available at https://github.com/tml-epfl/why-weight-decay.

Keyword: optimization

Surrogate-based Real-time Curbside Management for Ride-hailing and Delivery Operations

  • Authors: Authors: Suyash C. Vishnoi, Michele D. Simoni
  • Subjects: Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2310.03883
  • Pdf link: https://arxiv.org/pdf/2310.03883
  • Abstract The present work investigates surrogate model-based optimization for real-time curbside traffic management operations. An optimization problem is formulated to minimize the congestion on roadway segments caused by vehicles stopping on the segment (e.g., ride-hailing or delivery operations) and implemented in a model predictive control framework. A hybrid simulation approach where main traffic flows interact with individually modeled stopping vehicles is adopted. Due to its non-linearity, the optimization problem is coupled with a meta-heuristic. However, because simulations are time expensive and hence unsuitable for real-time control, a trained surrogate model that takes the decision variables as inputs and approximates the objective function is employed to replace the simulation within the meta-heuristic algorithm. Several modeling techniques (i.e., linear regression, polynomial regression, neural network, radial basis network, regression tree ensemble, and Gaussian process regression) are compared based on their accuracy in reproducing solutions to the problem and computational tractability for real-time control under different configurations of simulation parameters. It is found that Gaussian process regression is the most suited for use as a surrogate model for the given problem. Finally, a realistic application with multiple ride-hailing vehicle operations is presented. The proposed approach for controlling the stop positions of vehicles is able to achieve an improvement of 20.65% over the uncontrolled case. The example shows the potential of the proposed approach in reducing the negative impacts of stopping vehicles and favorable computational properties.

Accelerated Neural Network Training with Rooted Logistic Objectives

  • Authors: Authors: Zhu Wang, Praveen Raj Veluswami, Harsh Mishra, Sathya N. Ravi
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2310.03890
  • Pdf link: https://arxiv.org/pdf/2310.03890
  • Abstract Many neural networks deployed in the real world scenarios are trained using cross entropy based loss functions. From the optimization perspective, it is known that the behavior of first order methods such as gradient descent crucially depend on the separability of datasets. In fact, even in the most simplest case of binary classification, the rate of convergence depends on two factors: (1) condition number of data matrix, and (2) separability of the dataset. With no further pre-processing techniques such as over-parametrization, data augmentation etc., separability is an intrinsic quantity of the data distribution under consideration. We focus on the landscape design of the logistic function and derive a novel sequence of {\em strictly} convex functions that are at least as strict as logistic loss. The minimizers of these functions coincide with those of the minimum norm solution wherever possible. The strict convexity of the derived function can be extended to finetune state-of-the-art models and applications. In empirical experimental analysis, we apply our proposed rooted logistic objective to multiple deep models, e.g., fully-connected neural networks and transformers, on various of classification benchmarks. Our results illustrate that training with rooted loss function is converged faster and gains performance improvements. Furthermore, we illustrate applications of our novel rooted loss function in generative modeling based downstream applications, such as finetuning StyleGAN model with the rooted loss. The code implementing our losses and models can be found here for open source software development purposes: https://anonymous.4open.science/r/rooted_loss.

RTDK-BO: High Dimensional Bayesian Optimization with Reinforced Transformer Deep kernels

  • Authors: Authors: Alexander Shmakov, Avisek Naug, Vineet Gundecha, Sahand Ghorbanpour, Ricardo Luna Gutierrez, Ashwin Ramesh Babu, Antonio Guillen, Soumyendu Sarkar
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2310.03912
  • Pdf link: https://arxiv.org/pdf/2310.03912
  • Abstract Bayesian Optimization (BO), guided by Gaussian process (GP) surrogates, has proven to be an invaluable technique for efficient, high-dimensional, black-box optimization, a critical problem inherent to many applications such as industrial design and scientific computing. Recent contributions have introduced reinforcement learning (RL) to improve the optimization performance on both single function optimization and \textit{few-shot} multi-objective optimization. However, even few-shot techniques fail to exploit similarities shared between closely related objectives. In this paper, we combine recent developments in Deep Kernel Learning (DKL) and attention-based Transformer models to improve the modeling powers of GP surrogates with meta-learning. We propose a novel method for improving meta-learning BO surrogates by incorporating attention mechanisms into DKL, empowering the surrogates to adapt to contextual information gathered during the BO process. We combine this Transformer Deep Kernel with a learned acquisition function trained with continuous Soft Actor-Critic Reinforcement Learning to aid in exploration. This Reinforced Transformer Deep Kernel (RTDK-BO) approach yields state-of-the-art results in continuous high-dimensional optimization problems.

LaTeX: Language Pattern-aware Triggering Event Detection for Adverse Experience during Pandemics

  • Authors: Authors: Kaiqun Fu, Yangxiao Bai, Weiwei Zhang, Deepthi Kolady
  • Subjects: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
  • Arxiv link: https://arxiv.org/abs/2310.03941
  • Pdf link: https://arxiv.org/pdf/2310.03941
  • Abstract The COVID-19 pandemic has accentuated socioeconomic disparities across various racial and ethnic groups in the United States. While previous studies have utilized traditional survey methods like the Household Pulse Survey (HPS) to elucidate these disparities, this paper explores the role of social media platforms in both highlighting and addressing these challenges. Drawing from real-time data sourced from Twitter, we analyzed language patterns related to four major types of adverse experiences: loss of employment income (LI), food scarcity (FS), housing insecurity (HI), and unmet needs for mental health services (UM). We first formulate a sparsity optimization problem that extracts low-level language features from social media data sources. Second, we propose novel constraints on feature similarity exploiting prior knowledge about the similarity of the language patterns among the adverse experiences. The proposed problem is challenging to solve due to the non-convexity objective and non-smoothness penalties. We develop an algorithm based on the alternating direction method of multipliers (ADMM) framework to solve the proposed formulation. Extensive experiments and comparisons to other models on real-world social media and the detection of adverse experiences justify the efficacy of our model.

Gradient Descent Provably Solves Nonlinear Tomographic Reconstruction

  • Authors: Authors: Sara Fridovich-Keil, Fabrizio Valdivia, Gordon Wetzstein, Benjamin Recht, Mahdi Soltanolkotabi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC); Medical Physics (physics.med-ph)
  • Arxiv link: https://arxiv.org/abs/2310.03956
  • Pdf link: https://arxiv.org/pdf/2310.03956
  • Abstract In computed tomography (CT), the forward model consists of a linear Radon transform followed by an exponential nonlinearity based on the attenuation of light according to the Beer-Lambert Law. Conventional reconstruction often involves inverting this nonlinearity as a preprocessing step and then solving a convex inverse problem. However, this nonlinear measurement preprocessing required to use the Radon transform is poorly conditioned in the vicinity of high-density materials, such as metal. This preprocessing makes CT reconstruction methods numerically sensitive and susceptible to artifacts near high-density regions. In this paper, we study a technique where the signal is directly reconstructed from raw measurements through the nonlinear forward model. Though this optimization is nonconvex, we show that gradient descent provably converges to the global optimum at a geometric rate, perfectly reconstructing the underlying signal with a near minimal number of random measurements. We also prove similar results in the under-determined setting where the number of measurements is significantly smaller than the dimension of the signal. This is achieved by enforcing prior structural information about the signal through constraints on the optimization variables. We illustrate the benefits of direct nonlinear CT reconstruction with cone-beam CT experiments on synthetic and real 3D volumes. We show that this approach reduces metal artifacts compared to a commercial reconstruction of a human skull with metal dental crowns.

Adaptive Computation of an Elliptic Eigenvalue Optimization Problem with a Phase-Field Approach

  • Authors: Authors: Jing Li, Yifeng Xu, Shengfeng Zhu
  • Subjects: Numerical Analysis (math.NA)
  • Arxiv link: https://arxiv.org/abs/2310.03970
  • Pdf link: https://arxiv.org/pdf/2310.03970
  • Abstract In this paper, we discuss adaptive approximations of an elliptic eigenvalue optimization problem in a phase-field setting by a conforming finite element method. An adaptive algorithm is proposed and implemented in several two dimensional numerical examples for illustration of efficiency and accuracy. Theoretical findings consist in the vanishing limit of a subsequence of estimators and the convergence of the relevant subsequence of adaptively-generated solutions to a solution to the continuous optimality system.

The Floyd-Warshall Algorithm Re-implemented Using 3D-Tensors and Hardware Acceleration

  • Authors: Authors: Taher Anjary
  • Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
  • Arxiv link: https://arxiv.org/abs/2310.03983
  • Pdf link: https://arxiv.org/pdf/2310.03983
  • Abstract The Floyd-Warshall(FW) algorithm, is an ancient but a largely important algorithm used to solve the all-pairs simple-paths(APSP) problem. While the algorithm is available for use in open-source graph optimization libraries such as NetworkX, they do not take advantage of modern parallel processing hardware such as Graphics Processing Units(GPUs), which would reduce compute time to a fraction of its iterative or recursive implementations. In this work, a re-implementation of the Floyd-Warshall algorithm using open-source GPU libraries such as PyTorch is presented. A further implementation of the R-Kleene is also described, a slightly newer algorithm used for solving the APSP problem in a divide-and-conquer, recursive but highly parallelized architecture. In addition, a random graph generator that generates a wide range of graphs of different scales is also contributed, where the densities and connectivities are controlled using some heuristics. The run-times of the GPU accelerated FW algorithm and R-Kleene on these heuristically generated graphs are evaluated against each other and to the widely used implementation from NetworkX. The code for the GPU implementation of the algorithms, the random graph generator, and the Blender animation file are available at https://github.com/tanjary21/APSP_GPU/.

AdaRec: Adaptive Sequential Recommendation for Reinforcing Long-term User Engagement

  • Authors: Authors: Zhenghai Xue, Qingpeng Cai, Tianyou Zuo, Bin Yang, Lantao Hu, Peng Jiang, Kun Gai, Bo An
  • Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2310.03984
  • Pdf link: https://arxiv.org/pdf/2310.03984
  • Abstract Growing attention has been paid to Reinforcement Learning (RL) algorithms when optimizing long-term user engagement in sequential recommendation tasks. One challenge in large-scale online recommendation systems is the constant and complicated changes in users' behavior patterns, such as interaction rates and retention tendencies. When formulated as a Markov Decision Process (MDP), the dynamics and reward functions of the recommendation system are continuously affected by these changes. Existing RL algorithms for recommendation systems will suffer from distribution shift and struggle to adapt in such an MDP. In this paper, we introduce a novel paradigm called Adaptive Sequential Recommendation (AdaRec) to address this issue. AdaRec proposes a new distance-based representation loss to extract latent information from users' interaction trajectories. Such information reflects how RL policy fits to current user behavior patterns, and helps the policy to identify subtle changes in the recommendation system. To make rapid adaptation to these changes, AdaRec encourages exploration with the idea of optimism under uncertainty. The exploration is further guarded by zero-order action optimization to ensure stable recommendation quality in complicated environments. We conduct extensive empirical analyses in both simulator-based and live sequential recommendation tasks, where AdaRec exhibits superior long-term performance compared to all baseline algorithms.

Snail Homing and Mating Search Algorithm: A Novel Bio-Inspired Metaheuristic Algorithm

  • Authors: Authors: Anand J Kulkarni, Ishaan R Kale, Apoorva Shastri, Aayush Khandekar
  • Subjects: Neural and Evolutionary Computing (cs.NE)
  • Arxiv link: https://arxiv.org/abs/2310.04020
  • Pdf link: https://arxiv.org/pdf/2310.04020
  • Abstract In this paper, a novel Snail Homing and Mating Search (SHMS) algorithm is proposed. It is inspired from the biological behaviour of the snails. Snails continuously travels to find food and a mate, leaving behind a trail of mucus that serves as a guide for their return. Snails tend to navigate by following the available trails on the ground and responding to cues from nearby shelter homes. The proposed SHMS algorithm is investigated by solving several unimodal and multimodal functions. The solutions are validated using standard statistical tests such as two-sided and pairwise signed rank Wilcoxon test and Friedman rank test. The solution obtained from the SHMS algorithm exhibited superior robustness as well as search space exploration capabilities within the less computational cost. The real-world application of SHMS algorithm is successfully demonstrated in the engineering design domain by solving three cases of design and economic optimization shell and tube heat exchanger problem. The objective function value and other statistical results obtained using SHMS algorithm are compared with other well-known metaheuristic algorithms.

Nonlinear Methods for Shape Optimization Problems in Liquid Crystal Tactoids

  • Authors: Authors: James H. Adler, Anca S. Andrei, Timothy J. Atherton
  • Subjects: Numerical Analysis (math.NA)
  • Arxiv link: https://arxiv.org/abs/2310.04022
  • Pdf link: https://arxiv.org/pdf/2310.04022
  • Abstract Anisotropic fluids, such as nematic liquid crystals, can form non-spherical equilibrium shapes known as tactoids. Predicting the shape of these structures as a function of material parameters is challenging and paradigmatic of a broader class of problems that combine shape and order. Here, we develop a discrete shape optimization approach with finite elements to find the configuration of a two-dimensional tactoid using the Landau de Gennes framework and a Q-tensor representation. Efficient solution of the resulting constrained energy minimization problem is achieved using a quasi-Newton and nested iteration algorithm. Numerical validation is performed with benchmark solutions and compared against experimental data and earlier work. We explore physically motivated subproblems, whereby the shape and order are separately held fixed, to explore the role of both and examine material parameter dependence of the convergence. Nested iteration significantly improves both the computational cost and convergence of numerical solutions of these highly deformable materials.

Joint Projection Learning and Tensor Decomposition Based Incomplete Multi-view Clustering

  • Authors: Authors: Wei Lv, Chao Zhang, Huaxiong Li, Xiuyi Jia, Chunlin Chen
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2310.04038
  • Pdf link: https://arxiv.org/pdf/2310.04038
  • Abstract Incomplete multi-view clustering (IMVC) has received increasing attention since it is often that some views of samples are incomplete in reality. Most existing methods learn similarity subgraphs from original incomplete multi-view data and seek complete graphs by exploring the incomplete subgraphs of each view for spectral clustering. However, the graphs constructed on the original high-dimensional data may be suboptimal due to feature redundancy and noise. Besides, previous methods generally ignored the graph noise caused by the inter-class and intra-class structure variation during the transformation of incomplete graphs and complete graphs. To address these problems, we propose a novel Joint Projection Learning and Tensor Decomposition Based method (JPLTD) for IMVC. Specifically, to alleviate the influence of redundant features and noise in high-dimensional data, JPLTD introduces an orthogonal projection matrix to project the high-dimensional features into a lower-dimensional space for compact feature learning.Meanwhile, based on the lower-dimensional space, the similarity graphs corresponding to instances of different views are learned, and JPLTD stacks these graphs into a third-order low-rank tensor to explore the high-order correlations across different views. We further consider the graph noise of projected data caused by missing samples and use a tensor-decomposition based graph filter for robust clustering.JPLTD decomposes the original tensor into an intrinsic tensor and a sparse tensor. The intrinsic tensor models the true data similarities. An effective optimization algorithm is adopted to solve the JPLTD model. Comprehensive experiments on several benchmark datasets demonstrate that JPLTD outperforms the state-of-the-art methods. The code of JPLTD is available at https://github.com/weilvNJU/JPLTD.

Bivariate Estimation-of-Distribution Algorithms Can Find an Exponential Number of Optima

  • Authors: Authors: Benjamin Doerr, Martin S. Krejca
  • Subjects: Neural and Evolutionary Computing (cs.NE)
  • Arxiv link: https://arxiv.org/abs/2310.04042
  • Pdf link: https://arxiv.org/pdf/2310.04042
  • Abstract Finding a large set of optima in a multimodal optimization landscape is a challenging task. Classical population-based evolutionary algorithms typically converge only to a single solution. While this can be counteracted by applying niching strategies, the number of optima is nonetheless trivially bounded by the population size. Estimation-of-distribution algorithms (EDAs) are an alternative, maintaining a probabilistic model of the solution space instead of a population. Such a model is able to implicitly represent a solution set far larger than any realistic population size. To support the study of how optimization algorithms handle large sets of optima, we propose the test function EqualBlocksOneMax (EBOM). It has an easy fitness landscape with exponentially many optima. We show that the bivariate EDA mutual-information-maximizing input clustering, without any problem-specific modification, quickly generates a model that behaves very similarly to a theoretically ideal model for EBOM, which samples each of the exponentially many optima with the same maximal probability. We also prove via mathematical means that no univariate model can come close to having this property: If the probability to sample an optimum is at least inverse-polynomial, there is a Hamming ball of logarithmic radius such that, with high probability, each sample is in this ball.

Graph-based 3D Collision-distance Estimation Network with Probabilistic Graph Rewiring

  • Authors: Authors: Minjae Song, Yeseung Kim, Daehyung Park
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2310.04044
  • Pdf link: https://arxiv.org/pdf/2310.04044
  • Abstract We aim to solve the problem of data-driven collision-distance estimation given 3-dimensional (3D) geometries. Conventional algorithms suffer from low accuracy due to their reliance on limited representations, such as point clouds. In contrast, our previous graph-based model, GraphDistNet, achieves high accuracy using edge information but incurs higher message-passing costs with growing graph size, limiting its applicability to 3D geometries. To overcome these challenges, we propose GDN-R, a novel 3D graph-based estimation network.GDN-R employs a layer-wise probabilistic graph-rewiring algorithm leveraging the differentiable Gumbel-top-K relaxation. Our method accurately infers minimum distances through iterative graph rewiring and updating relevant embeddings. The probabilistic rewiring enables fast and robust embedding with respect to unforeseen categories of geometries. Through 41,412 random benchmark tasks with 150 pairs of 3D objects, we show GDN-R outperforms state-of-the-art baseline methods in terms of accuracy and generalizability. We also show that the proposed rewiring improves the update performance reducing the size of the estimation model. We finally show its batch prediction and auto-differentiation capabilities for trajectory optimization in both simulated and real-world scenarios.

Globally Optimal Resource Allocation Design for Discrete Phase Shift IRS-Assisted Multiuser Networks with Perfect and Imperfect CSI

  • Authors: Authors: Yifei Wu, Dongfang Xu, Derrick Wing Kwan Ng, Robert Schober, Wolfgang Gerstacker
  • Subjects: Information Theory (cs.IT)
  • Arxiv link: https://arxiv.org/abs/2310.04063
  • Pdf link: https://arxiv.org/pdf/2310.04063
  • Abstract Intelligent reflecting surfaces (IRSs) are a promising low-cost solution for achieving high spectral and energy efficiency in future communication systems by enabling the customization of wireless propagation environments. Despite the plethora of research on resource allocation design for IRS-assisted multiuser communication systems, the optimal design and the corresponding performance upper bound are still not fully understood. To bridge this gap in knowledge, in this paper, we investigate the optimal resource allocation design for IRS-assisted multiuser systems employing practical discrete IRS phase shifters. In particular, we jointly optimize the beamforming vector at the base station (BS) and the discrete IRS phase shifts to minimize the total transmit power for the cases of perfect and imperfect channel state information (CSI) knowledge. To this end, two novel algorithms based on the generalized Benders decomposition (GBD) method are developed to obtain the globally optimal solution for perfect and imperfect CSI, respectively. Moreover, to facilitate practical implementation, we propose two corresponding low-complexity suboptimal algorithms with guaranteed convergence by capitalizing on successive convex approximation (SCA). In particular, for imperfect CSI, we adopt a bounded error model to characterize the CSI uncertainty and propose a new transformation to convexify the robust quality-of-service (QoS) constraints. Our numerical results confirm the optimality of the proposed GBD-based algorithms for the considered system for both perfect and imperfect CSI. Furthermore, we unveil that both proposed SCA-based algorithms can achieve a close-to-optimal performance within a few iterations. Moreover, compared with the state-of-the-art solution based on the alternating optimization (AO) method, the proposed SCA-based scheme achieves a significant performance gain with low complexity.

Enumeration and updates for conjunctive linear algebra queries through expressibility

  • Authors: Authors: Thomas Muñoz, Cristian Riveros, Stijn Vansummeren
  • Subjects: Computational Complexity (cs.CC); Databases (cs.DB); Data Structures and Algorithms (cs.DS)
  • Arxiv link: https://arxiv.org/abs/2310.04118
  • Pdf link: https://arxiv.org/pdf/2310.04118
  • Abstract Due to the importance of linear algebra and matrix operations in data analytics, there is significant interest in using relational query optimization and processing techniques for evaluating (sparse) linear algebra programs. In particular, in recent years close connections have been established between linear algebra programs and relational algebra that allow transferring optimization techniques of the latter to the former. In this paper, we ask ourselves which linear algebra programs in MATLANG correspond to the free-connex and q-hierarchical fragments of conjunctive first-order logic. Both fragments have desirable query processing properties: free-connex conjunctive queries support constant-delay enumeration after a linear-time preprocessing phase, and q-hierarchical conjunctive queries further allow constant-time updates. By characterizing the corresponding fragments of MATLANG, we hence identify the fragments of linear algebra programs that one can evaluate with constant-delay enumeration after linear-time preprocessing and with constant-time updates. To derive our results, we improve and generalize previous correspondences between MATLANG and relational algebra evaluated over semiring-annotated relations. In addition, we identify properties on semirings that allow to generalize the complexity bounds for free-connex and q-hierarchical conjunctive queries from Boolean annotations to general semirings.

Routing Arena: A Benchmark Suite for Neural Routing Solvers

  • Authors: Authors: Daniela Thyssens, Tim Dernedde, Jonas K. Falkner, Lars Schmidt-Thieme
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2310.04140
  • Pdf link: https://arxiv.org/pdf/2310.04140
  • Abstract Neural Combinatorial Optimization has been researched actively in the last eight years. Even though many of the proposed Machine Learning based approaches are compared on the same datasets, the evaluation protocol exhibits essential flaws and the selection of baselines often neglects State-of-the-Art Operations Research approaches. To improve on both of these shortcomings, we propose the Routing Arena, a benchmark suite for Routing Problems that provides a seamless integration of consistent evaluation and the provision of baselines and benchmarks prevalent in the Machine Learning- and Operations Research field. The proposed evaluation protocol considers the two most important evaluation cases for different applications: First, the solution quality for an a priori fixed time budget and secondly the anytime performance of the respective methods. By setting the solution trajectory in perspective to a Best Known Solution and a Base Solver's solutions trajectory, we furthermore propose the Weighted Relative Average Performance (WRAP), a novel evaluation metric that quantifies the often claimed runtime efficiency of Neural Routing Solvers. A comprehensive first experimental evaluation demonstrates that the most recent Operations Research solvers generate state-of-the-art results in terms of solution quality and runtime efficiency when it comes to the vehicle routing problem. Nevertheless, some findings highlight the advantages of neural approaches and motivate a shift in how neural solvers should be conceptualized.

Amortized Network Intervention to Steer the Excitatory Point Processes

  • Authors: Authors: Zitao Song, Wendi Ren, Shuang Li
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2310.04159
  • Pdf link: https://arxiv.org/pdf/2310.04159
  • Abstract We tackle the challenge of large-scale network intervention for guiding excitatory point processes, such as infectious disease spread or traffic congestion control. Our model-based reinforcement learning utilizes neural ODEs to capture how the networked excitatory point processes will evolve subject to the time-varying changes in network topology. Our approach incorporates Gradient-Descent based Model Predictive Control (GD-MPC), offering policy flexibility to accommodate prior knowledge and constraints. To address the intricacies of planning and overcome the high dimensionality inherent to such decision-making problems, we design an Amortize Network Interventions (ANI) framework, allowing for the pooling of optimal policies from history and other contexts, while ensuring a permutation equivalent property. This property enables efficient knowledge transfer and sharing across diverse contexts. Our approach has broad applications, from curbing infectious disease spread to reducing carbon emissions through traffic light optimization, and thus has the potential to address critical societal and environmental challenges.

Light-LOAM: A Lightweight LiDAR Odometry and Mapping based on Graph-Matching

  • Authors: Authors: Shiquan Yi, Yang Lyu, Lin Hua, Quan Pan, Chunhui Zhao
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2310.04162
  • Pdf link: https://arxiv.org/pdf/2310.04162
  • Abstract Simultaneous Localization and Mapping (SLAM) plays an important role in robot autonomy. Reliability and efficiency are the two most valued features for applying SLAM in robot applications. In this paper, we consider achieving a reliable LiDAR-based SLAM function in computation-limited platforms, such as quadrotor UAVs based on graph-based point cloud association. First, contrary to most works selecting salient features for point cloud registration, we propose a non-conspicuous feature selection strategy for reliability and robustness purposes. Then a two-stage correspondence selection method is used to register the point cloud, which includes a KD-tree-based coarse matching followed by a graph-based matching method that uses geometric consistency to vote out incorrect correspondences. Additionally, we propose an odometry approach where the weight optimizations are guided by vote results from the aforementioned geometric consistency graph. In this way, the optimization of LiDAR odometry rapidly converges and evaluates a fairly accurate transformation resulting in the back-end module efficiently finishing the mapping task. Finally, we evaluate our proposed framework on the KITTI odometry dataset and real-world environments. Experiments show that our SLAM system achieves a comparative level or higher level of accuracy with more balanced computation efficiency compared with the mainstream LiDAR-based SLAM solutions.

Optimization with pattern-avoiding input

  • Authors: Authors: Benjamin Aram Berendsohn, László Kozma, Michal Opler
  • Subjects: Data Structures and Algorithms (cs.DS); Combinatorics (math.CO)
  • Arxiv link: https://arxiv.org/abs/2310.04236
  • Pdf link: https://arxiv.org/pdf/2310.04236
  • Abstract Permutation pattern-avoidance is a central concept of both enumerative and extremal combinatorics. In this paper we study the effect of permutation pattern-avoidance on the complexity of optimization problems. In the context of the dynamic optimality conjecture (Sleator, Tarjan, STOC 1983), Chalermsook, Goswami, Kozma, Mehlhorn, and Saranurak (FOCS 2015) conjectured that the amortized access cost of an optimal binary search tree (BST) is $O(1)$ whenever the access sequence avoids some fixed pattern. They showed a bound of $2^{\alpha{(n)}^{O(1)}}$, which was recently improved to $2^{\alpha{(n)}(1+o(1))}$ by Chalermsook, Pettie, and Yingchareonthawornchai (2023); here $n$ is the BST size and $\alpha(\cdot)$ the inverse-Ackermann function. In this paper we resolve the conjecture, showing a tight $O(1)$ bound. This indicates a barrier to dynamic optimality: any candidate online BST (e.g., splay trees or greedy trees) must match this optimum, but current analysis techniques only give superconstant bounds. More broadly, we argue that the easiness of pattern-avoiding input is a general phenomenon, not limited to BSTs or even to data structures. To illustrate this, we show that when the input avoids an arbitrary, fixed, a priori unknown pattern, one can efficiently compute a $k$-server solution of $n$ requests from a unit interval, with total cost $n^{O(1/\log k)}$, in contrast to the worst-case $\Theta(n/k)$ bound; and a traveling salesman tour of $n$ points from a unit box, of length $O(\log{n})$, in contrast to the worst-case $\Theta(\sqrt{n})$ bound; similar results hold for the euclidean minimum spanning tree, Steiner tree, and nearest-neighbor graphs. We show both results to be tight. Our techniques build on the Marcus-Tardos proof of the Stanley-Wilf conjecture, and on the recently emerging concept of twin-width; we believe our techniques to be more generally applicable.

Representative Days and Hours with Piecewise Linear Transitions for Power System Planning

  • Authors: Authors: Mojtaba Moradi-Sepahvand, Simon H. Tindemans
  • Subjects: Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2310.04239
  • Pdf link: https://arxiv.org/pdf/2310.04239
  • Abstract Electric demand and renewable power are highly variable, and the solution of a planning model relies on capturing this variability. This paper proposes a hybrid multi-area method that effectively captures both the intraday and interday chronology of real data considering extreme values, using a limited number of representative days, and time points within each day. An optimization-based representative extraction method is proposed to improve intraday chronology capturing. It ensures higher precision in preserving data chronology and extreme values than hierarchical clustering methods. The proposed method is based on a piecewise linear demand and supply representation, which reduces approximation errors compared to the traditional piecewise constant formulation. Additionally, sequentially linked day blocks with identical representatives, created through a mapping process, are employed for interday chronology capturing. To evaluate the efficiency of the proposed method, a comprehensive expansion co-planning model is developed, including transmission lines, energy storage systems, and wind farms.

On Solving Close Enough Orienteering Problem with Overlapped Neighborhoods

  • Authors: Authors: Qiuchen Qian, Yanran Wang, David Boyle
  • Subjects: Neural and Evolutionary Computing (cs.NE); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2310.04257
  • Pdf link: https://arxiv.org/pdf/2310.04257
  • Abstract The Close Enough Traveling Salesman Problem (CETSP) is a well-known variant of the classic Traveling Salesman Problem whereby the agent may complete its mission at any point within a target neighborhood. Heuristics based on overlapped neighborhoods, known as Steiner Zones (SZ), have gained attention in addressing CETSPs. While SZs offer effective approximations to the original graph, their inherent overlap imposes constraints on the search space, potentially conflicting with global optimization objectives. Here we present the Close Enough Orienteering Problem with Non-uniform Neighborhoods (CEOP-N), which extends CETSP by introducing variable prize attributes and non-uniform cost considerations for prize collection. To tackle CEOP-N, we develop a new approach featuring a Randomized Steiner Zone Discretization (RSZD) scheme coupled with a hybrid algorithm based on Particle Swarm Optimization (PSO) and Ant Colony System (ACS) - CRaSZe-AntS. The RSZD scheme identifies sub-regions for PSO exploration, and ACS determines the discrete visiting sequence. We evaluate the RSZD's discretization performance on CEOP instances derived from established CETSP instances, and compare CRaSZe-AntS against the most relevant state-of-the-art heuristic focused on single-neighborhood optimization for CEOP. We also compare the performance of the interior search within SZs and the boundary search on individual neighborhoods in the context of CEOP-N. Our results show CRaSZe-AntS can yield comparable solution quality with significantly reduced computation time compared to the single-neighborhood strategy, where we observe an averaged 140.44% increase in prize collection and 55.18% reduction of execution time. CRaSZe-AntS is thus highly effective in solving emerging CEOP-N, examples of which include truck-and-drone delivery scenarios.

Robust Losses for Decision-Focused Learning

  • Authors: Authors: Noah Schutte, Krzysztof Postek, Neil Yorke-Smith
  • Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
  • Arxiv link: https://arxiv.org/abs/2310.04328
  • Pdf link: https://arxiv.org/pdf/2310.04328
  • Abstract Optimization models used to make discrete decisions often contain uncertain parameters that are context-dependent and are estimated through prediction. To account for the quality of the decision made based on the prediction, decision-focused learning (end-to-end predict-then-optimize) aims at training the predictive model to minimize regret, i.e., the loss incurred by making a suboptimal decision. Despite the challenge of this loss function being possibly non-convex and in general non-differentiable, effective gradient-based learning approaches have been proposed to minimize the expected loss, using the empirical loss as a surrogate. However, empirical regret can be an ineffective surrogate because the uncertainty in the optimization model makes the empirical regret unequal to the expected regret in expectation. To illustrate the impact of this inequality, we evaluate the effect of aleatoric and epistemic uncertainty on the accuracy of empirical regret as a surrogate. Next, we propose three robust loss functions that more closely approximate expected regret. Experimental results show that training two state-of-the-art decision-focused learning approaches using robust regret losses improves test-sample empirical regret in general while keeping computational time equivalent relative to the number of training epochs.

Amortizing intractable inference in large language models

  • Authors: Authors: Edward J. Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, Nikolay Malkin
  • Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2310.04363
  • Pdf link: https://arxiv.org/pdf/2310.04363
  • Abstract Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distributions. We address this limitation by using amortized Bayesian inference to sample from these intractable posteriors. Such amortization is algorithmically achieved by fine-tuning LLMs via diversity-seeking reinforcement learning algorithms: generative flow networks (GFlowNets). We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training and reward-maximizing policy optimization. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem and demonstrate that our approach enables data-efficient adaptation of LLMs to tasks that require multi-step rationalization and tool use.

Confronting Reward Model Overoptimization with Constrained RLHF

  • Authors: Authors: Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, Stephen McAleer
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2310.04373
  • Pdf link: https://arxiv.org/pdf/2310.04373
  • Abstract Large language models are typically aligned with human preferences by optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM's threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally given by the Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.

Why Do We Need Weight Decay in Modern Deep Learning?

  • Authors: Authors: Maksym Andriushchenko, Francesco D'Angelo, Aditya Varre, Nicolas Flammarion
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2310.04415
  • Pdf link: https://arxiv.org/pdf/2310.04415
  • Abstract Weight decay is a broadly used technique for training state-of-the-art deep networks, including large language models. Despite its widespread usage, its role remains poorly understood. In this work, we highlight that the role of weight decay in modern deep learning is different from its regularization effect studied in classical learning theory. For overparameterized deep networks, we show how weight decay modifies the optimization dynamics enhancing the ever-present implicit regularization of SGD via the loss stabilization mechanism. In contrast, for underparameterized large language models trained with nearly online SGD, we describe how weight decay balances the bias-variance tradeoff in stochastic optimization leading to lower training loss. Moreover, we show that weight decay also prevents sudden loss divergences for bfloat16 mixed-precision training which is a crucial tool for LLM training. Overall, we present a unifying perspective from ResNets on vision tasks to LLMs: weight decay is never useful as an explicit regularizer but instead changes the training dynamics in a desirable way. Our code is available at https://github.com/tml-epfl/why-weight-decay.

Keyword: adam

There is no result

Keyword: gradient

Lightweight Boosting Models for User Response Prediction Using Adversarial Validation

  • Authors: Authors: Hyeonwoo Kim, Wonsung Lee
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2310.03778
  • Pdf link: https://arxiv.org/pdf/2310.03778
  • Abstract The ACM RecSys Challenge 2023, organized by ShareChat, aims to predict the probability of the app being installed. This paper describes the lightweight solution to this challenge. We formulate the task as a user response prediction task. For rapid prototyping for the task, we propose a lightweight solution including the following steps: 1) using adversarial validation, we effectively eliminate uninformative features from a dataset; 2) to address noisy continuous features and categorical features with a large number of unique values, we employ feature engineering techniques.; 3) we leverage Gradient Boosted Decision Trees (GBDT) for their exceptional performance and scalability. The experiments show that a single LightGBM model, without additional ensembling, performs quite well. Our team achieved ninth place in the challenge with the final leaderboard score of 6.059065. Code for our approach can be found here: https://github.com/choco9966/recsys-challenge-2023.

Small batch deep reinforcement learning

  • Authors: Authors: Johan Obando-Ceron, Marc G. Bellemare, Pablo Samuel Castro
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2310.03882
  • Pdf link: https://arxiv.org/pdf/2310.03882
  • Abstract In value-based deep reinforcement learning with replay memories, the batch size parameter specifies how many transitions to sample for each gradient update. Although critical to the learning process, this value is typically not adjusted when proposing new algorithms. In this work we present a broad empirical study that suggests {\em reducing} the batch size can result in a number of significant performance gains; this is surprising, as the general tendency when training neural networks is towards larger batch sizes for improved performance. We complement our experimental findings with a set of empirical analyses towards better understanding this phenomenon.

Accelerated Neural Network Training with Rooted Logistic Objectives

  • Authors: Authors: Zhu Wang, Praveen Raj Veluswami, Harsh Mishra, Sathya N. Ravi
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2310.03890
  • Pdf link: https://arxiv.org/pdf/2310.03890
  • Abstract Many neural networks deployed in the real world scenarios are trained using cross entropy based loss functions. From the optimization perspective, it is known that the behavior of first order methods such as gradient descent crucially depend on the separability of datasets. In fact, even in the most simplest case of binary classification, the rate of convergence depends on two factors: (1) condition number of data matrix, and (2) separability of the dataset. With no further pre-processing techniques such as over-parametrization, data augmentation etc., separability is an intrinsic quantity of the data distribution under consideration. We focus on the landscape design of the logistic function and derive a novel sequence of {\em strictly} convex functions that are at least as strict as logistic loss. The minimizers of these functions coincide with those of the minimum norm solution wherever possible. The strict convexity of the derived function can be extended to finetune state-of-the-art models and applications. In empirical experimental analysis, we apply our proposed rooted logistic objective to multiple deep models, e.g., fully-connected neural networks and transformers, on various of classification benchmarks. Our results illustrate that training with rooted loss function is converged faster and gains performance improvements. Furthermore, we illustrate applications of our novel rooted loss function in generative modeling based downstream applications, such as finetuning StyleGAN model with the rooted loss. The code implementing our losses and models can be found here for open source software development purposes: https://anonymous.4open.science/r/rooted_loss.

Gradient Descent Provably Solves Nonlinear Tomographic Reconstruction

  • Authors: Authors: Sara Fridovich-Keil, Fabrizio Valdivia, Gordon Wetzstein, Benjamin Recht, Mahdi Soltanolkotabi
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Optimization and Control (math.OC); Medical Physics (physics.med-ph)
  • Arxiv link: https://arxiv.org/abs/2310.03956
  • Pdf link: https://arxiv.org/pdf/2310.03956
  • Abstract In computed tomography (CT), the forward model consists of a linear Radon transform followed by an exponential nonlinearity based on the attenuation of light according to the Beer-Lambert Law. Conventional reconstruction often involves inverting this nonlinearity as a preprocessing step and then solving a convex inverse problem. However, this nonlinear measurement preprocessing required to use the Radon transform is poorly conditioned in the vicinity of high-density materials, such as metal. This preprocessing makes CT reconstruction methods numerically sensitive and susceptible to artifacts near high-density regions. In this paper, we study a technique where the signal is directly reconstructed from raw measurements through the nonlinear forward model. Though this optimization is nonconvex, we show that gradient descent provably converges to the global optimum at a geometric rate, perfectly reconstructing the underlying signal with a near minimal number of random measurements. We also prove similar results in the under-determined setting where the number of measurements is significantly smaller than the dimension of the signal. This is achieved by enforcing prior structural information about the signal through constraints on the optimization variables. We illustrate the benefits of direct nonlinear CT reconstruction with cone-beam CT experiments on synthetic and real 3D volumes. We show that this approach reduces metal artifacts compared to a commercial reconstruction of a human skull with metal dental crowns.

Amortized Network Intervention to Steer the Excitatory Point Processes

  • Authors: Authors: Zitao Song, Wendi Ren, Shuang Li
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2310.04159
  • Pdf link: https://arxiv.org/pdf/2310.04159
  • Abstract We tackle the challenge of large-scale network intervention for guiding excitatory point processes, such as infectious disease spread or traffic congestion control. Our model-based reinforcement learning utilizes neural ODEs to capture how the networked excitatory point processes will evolve subject to the time-varying changes in network topology. Our approach incorporates Gradient-Descent based Model Predictive Control (GD-MPC), offering policy flexibility to accommodate prior knowledge and constraints. To address the intricacies of planning and overcome the high dimensionality inherent to such decision-making problems, we design an Amortize Network Interventions (ANI) framework, allowing for the pooling of optimal policies from history and other contexts, while ensuring a permutation equivalent property. This property enables efficient knowledge transfer and sharing across diverse contexts. Our approach has broad applications, from curbing infectious disease spread to reducing carbon emissions through traffic light optimization, and thus has the potential to address critical societal and environmental challenges.

Robust Losses for Decision-Focused Learning

  • Authors: Authors: Noah Schutte, Krzysztof Postek, Neil Yorke-Smith
  • Subjects: Machine Learning (cs.LG); Optimization and Control (math.OC)
  • Arxiv link: https://arxiv.org/abs/2310.04328
  • Pdf link: https://arxiv.org/pdf/2310.04328
  • Abstract Optimization models used to make discrete decisions often contain uncertain parameters that are context-dependent and are estimated through prediction. To account for the quality of the decision made based on the prediction, decision-focused learning (end-to-end predict-then-optimize) aims at training the predictive model to minimize regret, i.e., the loss incurred by making a suboptimal decision. Despite the challenge of this loss function being possibly non-convex and in general non-differentiable, effective gradient-based learning approaches have been proposed to minimize the expected loss, using the empirical loss as a surrogate. However, empirical regret can be an ineffective surrogate because the uncertainty in the optimization model makes the empirical regret unequal to the expected regret in expectation. To illustrate the impact of this inequality, we evaluate the effect of aleatoric and epistemic uncertainty on the accuracy of empirical regret as a surrogate. Next, we propose three robust loss functions that more closely approximate expected regret. Experimental results show that training two state-of-the-art decision-focused learning approaches using robust regret losses improves test-sample empirical regret in general while keeping computational time equivalent relative to the number of training epochs.

Confronting Reward Model Overoptimization with Constrained RLHF

  • Authors: Authors: Ted Moskovitz, Aaditya K. Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca D. Dragan, Stephen McAleer
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2310.04373
  • Pdf link: https://arxiv.org/pdf/2310.04373
  • Abstract Large language models are typically aligned with human preferences by optimizing $\textit{reward models}$ (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to $\textit{overoptimization}$, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM's threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally given by the Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.

Policy-Gradient Training of Language Models for Ranking

  • Authors: Authors: Ge Gao, Jonathan D. Chang, Claire Cardie, Kianté Brantley, Thorsten Joachim
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2310.04407
  • Pdf link: https://arxiv.org/pdf/2310.04407
  • Abstract Text retrieval plays a crucial role in incorporating factual knowledge for decision making into language processing pipelines, ranging from chat-based web search to question answering systems. Current state-of-the-art text retrieval models leverage pre-trained large language models (LLMs) to achieve competitive performance, but training LLM-based retrievers via typical contrastive losses requires intricate heuristics, including selecting hard negatives and using additional supervision as learning signals. This reliance on heuristics stems from the fact that the contrastive loss itself is heuristic and does not directly optimize the downstream metrics of decision quality at the end of the processing pipeline. To address this issue, we introduce Neural PG-RANK, a novel training algorithm that learns to rank by instantiating a LLM as a Plackett-Luce ranking policy. Neural PG-RANK provides a principled method for end-to-end training of retrieval models as part of larger decision systems via policy gradient, with little reliance on complex heuristics, and it effectively unifies the training objective with downstream decision-making quality. We conduct extensive experiments on various text retrieval benchmarks. The results demonstrate that when the training objective aligns with the evaluation setup, Neural PG-RANK yields remarkable in-domain performance improvement, with substantial out-of-domain generalization to some critical datasets employed in downstream question answering tasks.

Keyword: super-resolution

Degradation-Aware Self-Attention Based Transformer for Blind Image Super-Resolution

  • Authors: Authors: Qingguo Liu, Pan Gao, Kang Han, Ningzhong Liu, Wei Xiang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2310.04180
  • Pdf link: https://arxiv.org/pdf/2310.04180
  • Abstract Compared to CNN-based methods, Transformer-based methods achieve impressive image restoration outcomes due to their abilities to model remote dependencies. However, how to apply Transformer-based methods to the field of blind super-resolution (SR) and further make an SR network adaptive to degradation information is still an open problem. In this paper, we propose a new degradation-aware self-attention-based Transformer model, where we incorporate contrastive learning into the Transformer network for learning the degradation representations of input images with unknown noise. In particular, we integrate both CNN and Transformer components into the SR network, where we first use the CNN modulated by the degradation information to extract local features, and then employ the degradation-aware Transformer to extract global semantic features. We apply our proposed model to several popular large-scale benchmark datasets for testing, and achieve the state-of-the-art performance compared to existing methods. In particular, our method yields a PSNR of 32.43 dB on the Urban100 dataset at $\times$2 scale, 0.94 dB higher than DASR, and 26.62 dB on the Urban100 dataset at $\times$4 scale, 0.26 dB improvement over KDSR, setting a new benchmark in this area. Source code is available at: https://github.com/I2-Multimedia-Lab/DSAT/tree/main.

zoq avatar Oct 09 '23 06:10 zoq