arxiv-updates icon indicating copy to clipboard operation
arxiv-updates copied to clipboard

New submissions for Fri, 17 Nov 23

Open zoq opened this issue 1 year ago • 0 comments

Keyword: sgd

There is no result

Keyword: optimization

Neural Packing: from Visual Sensing to Reinforcement Learning

  • Authors: Authors: Juzhan Xu, Minglun Gong, Hao Zhang, Hui Huang, Ruizhen Hu
  • Subjects: Machine Learning (cs.LG); Graphics (cs.GR); Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.09233
  • Pdf link: https://arxiv.org/pdf/2311.09233
  • Abstract We present a novel learning framework to solve the transport-and-packing (TAP) problem in 3D. It constitutes a full solution pipeline from partial observations of input objects via RGBD sensing and recognition to final box placement, via robotic motion planning, to arrive at a compact packing in a target container. The technical core of our method is a neural network for TAP, trained via reinforcement learning (RL), to solve the NP-hard combinatorial optimization problem. Our network simultaneously selects an object to pack and determines the final packing location, based on a judicious encoding of the continuously evolving states of partially observed source objects and available spaces in the target container, using separate encoders both enabled with attention mechanisms. The encoded feature vectors are employed to compute the matching scores and feasibility masks of different pairings of box selection and available space configuration for packing strategy optimization. Extensive experiments, including ablation studies and physical packing execution by a real robot (Universal Robot UR5e), are conducted to evaluate our method in terms of its design choices, scalability, generalizability, and comparisons to baselines, including the most recent RL-based TAP solution. We also contribute the first benchmark for TAP which covers a variety of input settings and difficulty levels.

Toward Ultra-Low-Power Remote Health Monitoring: An Optimal and Adaptive Compressed Sensing Framework for Activity Recognition

  • Authors: Authors: J. Pagan, R. Fallahzadeh, M. Pedram, José L. Risco-Martín, J. M. Moya, J. L. Ayala, H. Ghasemzadeh
  • Subjects: Networking and Internet Architecture (cs.NI)
  • Arxiv link: https://arxiv.org/abs/2311.09238
  • Pdf link: https://arxiv.org/pdf/2311.09238
  • Abstract Activity recognition, as an important component of behavioral monitoring and intervention, has attracted enormous attention, especially in Mobile Cloud Computing (MCC) and Remote Health Monitoring (RHM) paradigms. While recently resource constrained wearable devices have been gaining popularity, their battery life is limited and constrained by the frequent wireless transmission of data to more computationally powerful back-ends. This paper proposes an ultra-low power activity recognition system using a novel adaptive compressed sensing technique that aims to minimize transmission costs. Coarse-grained on-body sensor localization and unsupervised clustering modules are devised to autonomously reconfigure the compressed sensing module for further power saving. We perform a thorough heuristic optimization using Grammatical Evolution (GE) to ensure minimal computation overhead of the proposed methodology. Our evaluation on a real-world dataset and a low power wearable sensing node demonstrates that our approach can reduce the energy consumption of the wireless data transmission up to $81.2%$ and $61.5%$, with up to $60.6%$ and $35.0%$ overall power savings in comparison with baseline and a naive state-of-the-art approaches, respectively. These solutions lead to an average activity recognition accuracy of $89.0%$ -- only $4.8%$ less than the baseline accuracy -- while having a negligible energy overhead of on-node computation.

Affine Invariance in Continuous-Domain Convolutional Neural Networks

  • Authors: Authors: Ali Mohaddes, Johannes Lederer
  • Subjects: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2311.09245
  • Pdf link: https://arxiv.org/pdf/2311.09245
  • Abstract The notion of group invariance helps neural networks in recognizing patterns and features under geometric transformations. Indeed, it has been shown that group invariance can largely improve deep learning performances in practice, where such transformations are very common. This research studies affine invariance on continuous-domain convolutional neural networks. Despite other research considering isometric invariance or similarity invariance, we focus on the full structure of affine transforms generated by the generalized linear group $\mathrm{GL}_2(\mathbb{R})$. We introduce a new criterion to assess the similarity of two input signals under affine transformations. Then, unlike conventional methods that involve solving complex optimization problems on the Lie group $G_2$, we analyze the convolution of lifted signals and compute the corresponding integration over $G_2$. In sum, our research could eventually extend the scope of geometrical transformations that practical deep-learning pipelines can handle.

Pinpoint, Not Criticize: Refining Large Language Models via Fine-Grained Actionable Feedback

  • Authors: Authors: Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhongtao Liu, William Yang Wang, Lei Li, Markus Freitag
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2311.09336
  • Pdf link: https://arxiv.org/pdf/2311.09336
  • Abstract Recent improvements in text generation have leveraged human feedback to improve the quality of the generated output. However, human feedback is not always available, especially during inference. In this work, we propose an inference time optimization method FITO to use fine-grained actionable feedback in the form of error type, error location and severity level that are predicted by a learned error pinpoint model for iterative refinement. FITO starts with an initial output, then iteratively incorporates the feedback via a refinement model that generates an improved output conditioned on the feedback. Given the uncertainty of consistent refined samples at iterative steps, we formulate iterative refinement into a local search problem and develop a simulated annealing based algorithm that balances exploration of the search space and optimization for output quality. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA) and topical summarization. We observe 0.8 and 0.7 MetricX gain on Chinese-English and English-German translation, 4.5 and 1.8 ROUGE-L gain at long form QA and topic summarization respectively, with a single iteration of refinement. With our simulated annealing algorithm, we see further quality improvements, including up to 1.7 MetricX improvements over the baseline approach.

A Software-Hardware Co-Optimized Toolkit for Deep Reinforcement Learning on Heterogeneous Platforms

  • Authors: Authors: Yuan Meng, Michael Kinsner, Deshanand Singh, Mahesh A Iyer, Viktor Prasanna
  • Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
  • Arxiv link: https://arxiv.org/abs/2311.09445
  • Pdf link: https://arxiv.org/pdf/2311.09445
  • Abstract Deep Reinforcement Learning (DRL) is vital in various AI applications. DRL algorithms comprise diverse compute kernels, which may not be simultaneously optimized using a homogeneous architecture. However, even with available heterogeneous architectures, optimizing DRL performance remains a challenge due to the complexity of hardware and programming models employed in modern data centers. To address this, we introduce PEARL, a toolkit for composing parallel DRL systems on heterogeneous platforms consisting of general-purpose processors (CPUs) and accelerators (GPUs, FPGAs). Our innovations include: 1. A general training protocol agnostic of the underlying hardware, enabling portable implementations across various processors and accelerators. 2. Incorporation of DRL-specific scheduling optimizations within the protocol, facilitating parallelized training and enhancing the overall system performance. 3. High-level API for productive development using the toolkit. 4. Automatic optimization of DRL task-to-device assignments through performance estimation, supporting various optimization metrics including throughput and power efficiency. We showcase our toolkit through experimentation with two widely used DRL algorithms, DQN and DDPG, on two diverse heterogeneous platforms. The generated implementations outperform state-of-the-art libraries for CPU-GPU platforms by throughput improvements of up to 2.1$\times$ and power efficiency improvements of up to 3.4$\times$.

DeepMartNet -- A Martingale Based Deep Neural Network Learning Method for Dirichlet BVP and Eigenvalue Problems of Elliptic PDEs

  • Authors: Authors: Wei Cai, Andrew He, Daniel Margolis
  • Subjects: Numerical Analysis (math.NA)
  • Arxiv link: https://arxiv.org/abs/2311.09456
  • Pdf link: https://arxiv.org/pdf/2311.09456
  • Abstract In this paper, we propose DeepMartNet - a Martingale based deep neural network learning method for solving Dirichlet boundary value problems (BVPs) and eigenvalue problems for elliptic partial differential equations (PDEs) in high dimensions. The method is based on Varadhan's Martingale problem formulation for the BVP/eigenvalue problems where a loss function enforcing the Martingale property for the PDE solution is used for efficient optimization by sampling the stochastic processes associated with elliptic operators. High dimensional numerical results for BVPs of the Poisson-Boltzmann equation and eigenvalue problems of a Fokker-Planck equation demonstrate the capability of the proposed DeepMartNet learning method for solving high dimensional PDE problems.

SparseAuto: An Auto-Scheduler for Sparse Tensor Computations Using Recursive Loop Nest Restructuring

  • Authors: Authors: Adhitha Dias, Logan Anderson, Kirshanthan Sundararajah, Artem Pelenitsyn, Milind Kulkarni
  • Subjects: Programming Languages (cs.PL)
  • Arxiv link: https://arxiv.org/abs/2311.09549
  • Pdf link: https://arxiv.org/pdf/2311.09549
  • Abstract Automated code generation and performance optimizations for sparse tensor algebra are cardinal since they have become essential in many real-world applications like quantum computing, physics, chemistry, and machine learning. General sparse tensor algebra compilers are not always versatile enough to generate asymptotically optimal code for sparse tensor contractions. This paper shows how to optimize and generate asymptotically better schedules for complex tensor expressions using kernel fission and fusion. We present a generalized loop transformation to achieve loop nesting for minimized memory footprint and reduced asymptotic complexity. Furthermore, we present an auto-scheduler that uses a partially ordered set-based cost model that uses both time and auxiliary memory complexities in its pruning stages. In addition, we highlight the use of SMT solvers in sparse auto-schedulers to prune the Pareto frontier of schedules to the smallest number of possible schedules with user-defined constraints available at compile time. Finally, we show that our auto-scheduler can select asymptotically better schedules that use our compiler transformation to generate optimized code. Our results show that the auto-scheduler achieves orders of magnitude speedup compared to the TACO-generated code for several real-world tensor algebra computations on different real-world inputs.

Accelerating material discovery with a threshold-driven hybrid acquisition policy-based Bayesian optimization

  • Authors: Authors: Ahmed Shoyeb Raihan, Hamed Khosravi, Srinjoy Das, Imtiaz Ahmed
  • Subjects: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Machine Learning (stat.ML)
  • Arxiv link: https://arxiv.org/abs/2311.09591
  • Pdf link: https://arxiv.org/pdf/2311.09591
  • Abstract Advancements in materials play a crucial role in technological progress. However, the process of discovering and developing materials with desired properties is often impeded by substantial experimental costs, extensive resource utilization, and lengthy development periods. To address these challenges, modern approaches often employ machine learning (ML) techniques such as Bayesian Optimization (BO), which streamline the search for optimal materials by iteratively selecting experiments that are most likely to yield beneficial results. However, traditional BO methods, while beneficial, often struggle with balancing the trade-off between exploration and exploitation, leading to sub-optimal performance in material discovery processes. This paper introduces a novel Threshold-Driven UCB-EI Bayesian Optimization (TDUE-BO) method, which dynamically integrates the strengths of Upper Confidence Bound (UCB) and Expected Improvement (EI) acquisition functions to optimize the material discovery process. Unlike the classical BO, our method focuses on efficiently navigating the high-dimensional material design space (MDS). TDUE-BO begins with an exploration-focused UCB approach, ensuring a comprehensive initial sweep of the MDS. As the model gains confidence, indicated by reduced uncertainty, it transitions to the more exploitative EI method, focusing on promising areas identified earlier. The UCB-to-EI switching policy dictated guided through continuous monitoring of the model uncertainty during each step of sequential sampling results in navigating through the MDS more efficiently while ensuring rapid convergence. The effectiveness of TDUE-BO is demonstrated through its application on three different material datasets, showing significantly better approximation and optimization performance over the EI and UCB-based BO methods in terms of the RMSE scores and convergence efficiency, respectively.

On Retrieval Augmentation and the Limitations of Language Model Training

  • Authors: Authors: Ting-Rui Chiang, Xinyan Velocity Yu, Joshua Robinson, Ollie Liu, Isabelle Lee, Dani Yogatama
  • Subjects: Computation and Language (cs.CL)
  • Arxiv link: https://arxiv.org/abs/2311.09615
  • Pdf link: https://arxiv.org/pdf/2311.09615
  • Abstract Augmenting a language model (LM) with $k$-nearest neighbors (kNN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remains elusive. In this work, we first rule out one previously posited possibility -- the "softmax bottleneck." We further identify the MLP hurdle phenomenon, where the final MLP layer in LMs may impede LM optimization early on. We explore memorization and generalization in language models with two new datasets, where advanced model like GPT-3.5-turbo find generalizing to irrelevant information in the training data challenging. However, incorporating kNN retrieval to vanilla GPT-2 117M can consistently improve performance in this setting.

Reconstructing Continuous Light Field From Single Coded Image

  • Authors: Authors: Yuya Ishikawa, Keita Takahashi, Chihiro Tsutake, Toshiaki Fujii
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Image and Video Processing (eess.IV)
  • Arxiv link: https://arxiv.org/abs/2311.09646
  • Pdf link: https://arxiv.org/pdf/2311.09646
  • Abstract We propose a method for reconstructing a continuous light field of a target scene from a single observed image. Our method takes the best of two worlds: joint aperture-exposure coding for compressive light-field acquisition, and a neural radiance field (NeRF) for view synthesis. Joint aperture-exposure coding implemented in a camera enables effective embedding of 3-D scene information into an observed image, but in previous works, it was used only for reconstructing discretized light-field views. NeRF-based neural rendering enables high quality view synthesis of a 3-D scene from continuous viewpoints, but when only a single image is given as the input, it struggles to achieve satisfactory quality. Our method integrates these two techniques into an efficient and end-to-end trainable pipeline. Trained on a wide variety of scenes, our method can reconstruct continuous light fields accurately and efficiently without any test time optimization. To our knowledge, this is the first work to bridge two worlds: camera design for efficiently acquiring 3-D information and neural rendering.

Do Physicians Know How to Prompt? The Need for Automatic Prompt Optimization Help in Clinical Note Generation

  • Authors: Authors: Zonghai Yao, Ahmed Jaafar, Beining Wang, Yue Zhu, Zhichao Yang, Hong Yu
  • Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2311.09684
  • Pdf link: https://arxiv.org/pdf/2311.09684
  • Abstract This study examines the effect of prompt engineering on the performance of Large Language Models (LLMs) in clinical note generation. We introduce an Automatic Prompt Optimization (APO) framework to refine initial prompts and compare the outputs of medical experts, non-medical experts, and APO-enhanced GPT3.5 and GPT4. Results highlight GPT4 APO's superior performance in standardizing prompt quality across clinical note sections. A human-in-the-loop approach shows that experts maintain content quality post-APO, with a preference for their own modifications, suggesting the value of expert customization. We recommend a two-phase optimization process, leveraging APO-GPT4 for consistency and expert input for personalization.

CDMPP: A Device-Model Agnostic Framework for Latency Prediction of Tensor Programs

  • Authors: Authors: Hanpeng Hu, Junwei Su, Juntao Zhao, Yanghua Peng, Yibo Zhu, Haibin Lin, Chuan Wu
  • Subjects: Machine Learning (cs.LG); Performance (cs.PF)
  • Arxiv link: https://arxiv.org/abs/2311.09690
  • Pdf link: https://arxiv.org/pdf/2311.09690
  • Abstract Deep Neural Networks (DNNs) have shown excellent performance in a wide range of machine learning applications. Knowing the latency of running a DNN model or tensor program on a specific device is useful in various tasks, such as DNN graph- or tensor-level optimization and device selection. Considering the large space of DNN models and devices that impede direct profiling of all combinations, recent efforts focus on building a predictor to model the performance of DNN models on different devices. However, none of the existing attempts have achieved a cost model that can accurately predict the performance of various tensor programs while supporting both training and inference accelerators. We propose CDMPP, an efficient tensor program latency prediction framework for both cross-model and cross-device prediction. We design an informative but efficient representation of tensor programs, called compact ASTs, and a pre-order-based positional encoding method, to capture the internal structure of tensor programs. We develop a domain-adaption-inspired method to learn domain-invariant representations and devise a KMeans-based sampling algorithm, for the predictor to learn from different domains (i.e., different DNN operators and devices). Our extensive experiments on a diverse range of DNN models and devices demonstrate that CDMPP significantly outperforms state-of-the-art baselines with 14.03% and 10.85% prediction error for cross-model and cross-device prediction, respectively, and one order of magnitude higher training efficiency. The implementation and the expanded dataset are available at https://github.com/joapolarbear/cdmpp.

GEO: Generative Engine Optimization

  • Authors: Authors: Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik R Narasimhan, Ameet Deshpande
  • Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR)
  • Arxiv link: https://arxiv.org/abs/2311.09735
  • Pdf link: https://arxiv.org/pdf/2311.09735
  • Abstract The advent of large language models (LLMs) has ushered in a new paradigm of search engines that use generative models to gather and summarize information to answer user queries. This emerging technology, which we formalize under the unified framework of Generative Engines (GEs), has the potential to generate accurate and personalized responses, and is rapidly replacing traditional search engines like Google and Bing. Generative Engines typically satisfy queries by synthesizing information from multiple sources and summarizing them with the help of LLMs. While this shift significantly improves \textit{user} utility and \textit{generative search engine} traffic, it results in a huge challenge for the third stakeholder -- website and content creators. Given the black-box and fast-moving nature of Generative Engines, content creators have little to no control over when and how their content is displayed. With generative engines here to stay, the right tools should be provided to ensure that creator economy is not severely disadvantaged. To address this, we introduce Generative Engine Optimization (GEO), a novel paradigm to aid content creators in improving the visibility of their content in Generative Engine responses through a black-box optimization framework for optimizing and defining visibility metrics. We facilitate systematic evaluation in this new paradigm by introducing GEO-bench, a benchmark of diverse user queries across multiple domains, coupled with sources required to answer these queries. Through rigorous evaluation, we show that GEO can boost visibility by up to 40% in generative engine responses. Moreover, we show the efficacy of these strategies varies across domains, underscoring the need for domain-specific methods. Our work opens a new frontier in the field of information discovery systems, with profound implications for generative engines and content creators.

Low-cost singular value decomposition with optimal sensor placement

  • Authors: Authors: Ashton Hetherington, Soledad Le Clainche
  • Subjects: Computational Engineering, Finance, and Science (cs.CE)
  • Arxiv link: https://arxiv.org/abs/2311.09791
  • Pdf link: https://arxiv.org/pdf/2311.09791
  • Abstract This paper presents a new method capable of reconstructing datasets with great precision and very low computational cost using a novel variant of the singular value decomposition (SVD) algorithm that has been named low-cost SVD (lcSVD). This algorithm allows to reconstruct a dataset from a minimum amount of points, that can be selected randomly, equidistantly or can be calculated using the optimal sensor placement functionality that is also presented in this paper, which finds minimizing the reconstruction error to validate the calculated sensor positions. This method also allows to find the optimal number of sensors, aiding users in optimizing experimental data recollection. The method is tested in a series of datasets, which vary between experimental and numerical simulations, two- and three-dimensional data and laminar and turbulent flow, have been used to demonstrate the capacity of this method based on its high reconstruction accuracy, robustness, and computational resource optimization. Maximum speed-up factors of 630 and memory reduction of 37% are found when compared to the application of standard SVD to the dataset.

Load Data Valuation in Multi-Energy Systems: An End-to-End Approach

  • Authors: Authors: Yangze Zhou, Qingsong Wen, Jie Song, Xueyuan Cui, Yi Wang
  • Subjects: Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2311.09839
  • Pdf link: https://arxiv.org/pdf/2311.09839
  • Abstract Accurate load forecasting serves as the foundation for the flexible operation of multi-energy systems (MES). Multi-energy loads are tightly coupled and exhibit significant uncertainties. Many works focus on enhancing forecasting accuracy by leveraging cross-sector information. However, data owners may not be motivated to share their data unless it leads to substantial benefits. Ensuring a reasonable data valuation can encourage them to share their data willingly. This paper presents an end-to-end framework to quantify multi-energy load data value by integrating forecasting and decision processes. To address optimization problems with integer variables, a two-stage end-to-end model solution is proposed. Moreover, a profit allocation strategy based on contribution to cost savings is investigated to encourage data sharing in MES. The experimental results demonstrate a significant decrease in operation costs, suggesting that the proposed valuation approach more effectively extracts the inherent data value than traditional methods. According to the proposed incentive mechanism, all sectors can benefit from data sharing by improving forecasting accuracy or receiving economic compensation.

Semantic-Relay-Aided Text Transmission: Placement Optimization and Bandwidth Allocation

  • Authors: Authors: Tianyu Liu, Changsheng You, Zeyang Hu, Chenyu Wu, Yi Gong, Kaibin Huang
  • Subjects: Information Theory (cs.IT); Signal Processing (eess.SP)
  • Arxiv link: https://arxiv.org/abs/2311.09850
  • Pdf link: https://arxiv.org/pdf/2311.09850
  • Abstract Semantic communication has emerged as a promising technology to break the Shannon limit by extracting the meaning of source data and sending relevant semantic information only. However, some mobile devices may have limited computation and storage resources, which renders it difficult to deploy and implement the resource-demanding deep learning based semantic encoder/decoder. To tackle this challenge, we propose in this paper a new semantic relay (SemRelay), which is equipped with a semantic receiver for assisting text transmission from a resource-abundant base station (BS) to a resource-constrained mobile device. Specifically, the SemRelay first decodes the semantic information sent by the BS (with a semantic transmitter) and then forwards it to the user by adopting conventional bit transmission, hence effectively improving the text transmission efficiency. We formulate an optimization problem to maximize the achievable (effective) bit rate by jointly designing the SemRelay placement and bandwidth allocation. Although this problem is non-convex and generally difficult to solve, we propose an efficient penalty-based algorithm to obtain a high-quality suboptimal solution. Numerical results show the close-to-optimal performance of the proposed algorithm as well as significant rate performance gain of the proposed SemRelay over conventional decode-and-forward relay.

Short vs. Long-term Coordination of Drones: When Distributed Optimization Meets Deep Reinforcement Learning

  • Authors: Authors: Chuhao Qin, Evangelos Pournaras
  • Subjects: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
  • Arxiv link: https://arxiv.org/abs/2311.09852
  • Pdf link: https://arxiv.org/pdf/2311.09852
  • Abstract Swarms of smart drones, with the support of charging technology, can provide completing sensing capabilities in Smart Cities, such as traffic monitoring and disaster response. Existing approaches, including distributed optimization and deep reinforcement learning (DRL), aim to coordinate drones to achieve cost-effective, high-quality navigation, sensing, and recharging. However, they have distinct challenges: short-term optimization struggles to provide sustained benefits, while long-term DRL lacks scalability, resilience, and flexibility. To bridge this gap, this paper introduces a new progressive approach that encompasses the planning and selection based on distributed optimization, as well as DRL-based flying direction scheduling. Extensive experiment with datasets generated from realisitic urban mobility demonstrate the outstanding performance of the proposed solution in traffic monitoring compared to three baseline methods.

Safety Aware Autonomous Path Planning Using Model Predictive Reinforcement Learning for Inland Waterways

  • Authors: Authors: Astrid Vanneste, Simon Vanneste, Olivier Vasseur, Robin Janssens, Mattias Billast, Ali Anwar, Kevin Mets, Tom De Schepper, Siegfried Mercelis, Peter Hellinckx
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.09878
  • Pdf link: https://arxiv.org/pdf/2311.09878
  • Abstract In recent years, interest in autonomous shipping in urban waterways has increased significantly due to the trend of keeping cars and trucks out of city centers. Classical approaches such as Frenet frame based planning and potential field navigation often require tuning of many configuration parameters and sometimes even require a different configuration depending on the situation. In this paper, we propose a novel path planning approach based on reinforcement learning called Model Predictive Reinforcement Learning (MPRL). MPRL calculates a series of waypoints for the vessel to follow. The environment is represented as an occupancy grid map, allowing us to deal with any shape of waterway and any number and shape of obstacles. We demonstrate our approach on two scenarios and compare the resulting path with path planning using a Frenet frame and path planning based on a proximal policy optimization (PPO) agent. Our results show that MPRL outperforms both baselines in both test scenarios. The PPO based approach was not able to reach the goal in either scenario while the Frenet frame approach failed in the scenario consisting of a corner with obstacles. MPRL was able to safely (collision free) navigate to the goal in both of the test scenarios.

Cross-Layer Optimization for Statistical QoS Provision in C-RAN with Finite-Length Coding

  • Authors: Authors: Chang Wu, Hancheng Lu, Yuang Chen, Langtian Qin
  • Subjects: Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
  • Arxiv link: https://arxiv.org/abs/2311.09879
  • Pdf link: https://arxiv.org/pdf/2311.09879
  • Abstract The cloud radio access network (C-RAN) has become the foundational structure for various emerging communication paradigms, leveraging the flexible deployment of distributed access points (APs) and centralized task processing. In this paper, we propose a cross-layer optimization framework based on a practical finite-length coding communication system in C-RAN, aiming at maximizing bandwidth efficiency while providing statistical quality of service (QoS) for individual services. Based on the theoretical results from effective capacity and finite-length coding, we formulate a joint optimization problem involving modulation and coding schemes (MCS), retransmission count, initial bandwidth allocation and AP selection, which reflects the coordinated decision of parameters across the physical layer, data link layer and transport layer. To tackle such a mixed-integer nonlinear programming (MINLP) problem, we firstly decompose it into a transmission parameter decision (TPD) sub-problem and a user association (UA) sub-problem, which can be solved by a binary search-based algorithm and an auction-based algorithm respectively. Simulation results demonstrate that the proposed model can accurately capture the impact of QoS requirements and channel quality on the optimal transmission parameters. Furthermore, compared with fixed transmission parameter setting, the proposed algorithms achieve the bandwidth efficiency gain up to 27.87% under various traffic and channel scenarios.

The Software Genome Project: Venture to the Genomic Pathways of Open Source Software and Its Applications

  • Authors: Authors: Yueming Wu, Chengwei Liu, Yang Liu
  • Subjects: Software Engineering (cs.SE); Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2311.09881
  • Pdf link: https://arxiv.org/pdf/2311.09881
  • Abstract With the boom in modern software development, open-source software has become an integral part of various industries, driving progress in computer science. However, the immense complexity and diversity of the open-source ecosystem also pose a series of challenges, including issues of quality, security, management, maintenance, compliance, and sustainability. Existing open-source governance approaches, while excelling in community building and collaboration, still face shortcomings in decentralized management, security, and maintenance. To address these challenges, inspired by the Human Genome Project, we treat the software source code as software DNA and propose the \textbf{Software Genome Project}, which is geared towards the secure monitoring and exploitation of open-source software. By identifying and labeling integrated and classified code features at a fine-grained level, and effectively identifying safeguards for functional implementations and non-functional requirements at different levels of granularity, Software Genome Project builds a complete set of software genome maps to help developers and managers gain a deeper understanding of software complexity and diversity. By dissecting and summarizing functional and undesirable genes, Software Genome Project helps facilitate targeted software remediation and optimization, provides valuable insight and understanding of the entire software ecosystem, and supports critical development tasks such as technology selection and open source governance. This project is expected to drive the evolution of software development towards more efficient, reliable, and sustainable software solutions.

Dynamic modeling of an alkaline electrolyzer plant for process simulation and optimization

  • Authors: Authors: Nicola Cantisani, Josefine Dovits, John Bagterp Jørgensen
  • Subjects: Systems and Control (eess.SY)
  • Arxiv link: https://arxiv.org/abs/2311.09882
  • Pdf link: https://arxiv.org/pdf/2311.09882
  • Abstract We develop a mathematical model for dynamical simulation of an alkaline electrolyzer plant. We model each component of the system with mass and energy balances. Our modeling strategy consists of a rigorous and systematic formulation using differential algebraic equations (DAE), along with a thermodynamic library that evaluates thermophysical properties. We show steady state diagrams for the electrolyzer stack, and perform dynamic simulations. Dynamic modelling of an electrolyzer enables simulation and model-based optimization and control for optimal hydrogen production under varying operating conditions.

Scalable Sequential Optimization Under Observability Don't Cares

  • Authors: Authors: Dewmini Sudara Marakkalage, Eleonora Testa, Walter Lau Neto, Alan Mishchenko, Giovanni De Micheli, Luca Amarù
  • Subjects: Logic in Computer Science (cs.LO)
  • Arxiv link: https://arxiv.org/abs/2311.09967
  • Pdf link: https://arxiv.org/pdf/2311.09967
  • Abstract Sequential logic synthesis can provide better Power-Performance-Area (PPA) than combinational logic synthesis since it explores a larger solution space. As the gate cost in advanced technologies keeps rising, sequential logic synthesis provides a powerful alternative that is gaining momentum in the EDA community. In this work, we present a new scalable algorithm for don't-care-based sequential logic synthesis. Our new approach is based on sequential k-step induction and can apply both redundancy removal and resubstitution transformations under Sequential Observability Don't Cares (SODCs). Using SODC-based optimizations with induction is a challenging problem due to dependencies and alignment of don't cares among the base case and the inductive case. We propose a new approach utilizing the full power of SODCs without limiting the solution space. Our algorithm is implemented as part of an industrial tool and achieves 6.9% average area improvement after technology mapping when compared to state-of-the-art sequential synthesis methods. Moreover, all the new sequential optimizations can be verified using state-of-the-art sequential verification tools.

Xputer: Bridging Data Gaps with NMF, XGBoost, and a Streamlined GUI Experience

  • Authors: Authors: Saleena Younus, Lars Rönnstrand, Julhash U. Kazi
  • Subjects: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Methodology (stat.ME)
  • Arxiv link: https://arxiv.org/abs/2311.09989
  • Pdf link: https://arxiv.org/pdf/2311.09989
  • Abstract The rapid proliferation of data across diverse fields has accentuated the importance of accurate imputation for missing values. This task is crucial for ensuring data integrity and deriving meaningful insights. In response to this challenge, we present Xputer, a novel imputation tool that adeptly integrates Non-negative Matrix Factorization (NMF) with the predictive strengths of XGBoost. One of Xputer's standout features is its versatility: it supports zero imputation, enables hyperparameter optimization through Optuna, and allows users to define the number of iterations. For enhanced user experience and accessibility, we have equipped Xputer with an intuitive Graphical User Interface (GUI) ensuring ease of handling, even for those less familiar with computational tools. In performance benchmarks, Xputer not only rivals the computational speed of established tools such as IterativeImputer but also often outperforms them in terms of imputation accuracy. Furthermore, Xputer autonomously handles a diverse spectrum of data types, including categorical, continuous, and Boolean, eliminating the need for prior preprocessing. Given its blend of performance, flexibility, and user-friendly design, Xputer emerges as a state-of-the-art solution in the realm of data imputation.

Interpretable Reinforcement Learning for Robotics and Continuous Control

  • Authors: Authors: Rohan Paleja, Letian Chen, Yaru Niu, Andrew Silva, Zhaoxin Li, Songan Zhang, Chace Ritchie, Sugju Choi, Kimberlee Chestnut Chang, Hongtei Eric Tseng, Yan Wang, Subramanya Nageshrao, Matthew Gombolay
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.10041
  • Pdf link: https://arxiv.org/pdf/2311.10041
  • Abstract Interpretability in machine learning is critical for the safe deployment of learned policies across legally-regulated and safety-critical domains. While gradient-based approaches in reinforcement learning have achieved tremendous success in learning policies for continuous control problems such as robotics and autonomous driving, the lack of interpretability is a fundamental barrier to adoption. We propose Interpretable Continuous Control Trees (ICCTs), a tree-based model that can be optimized via modern, gradient-based, reinforcement learning approaches to produce high-performing, interpretable policies. The key to our approach is a procedure for allowing direct optimization in a sparse decision-tree-like representation. We validate ICCTs against baselines across six domains, showing that ICCTs are capable of learning policies that parity or outperform baselines by up to 33% in autonomous driving scenarios while achieving a 300x-600x reduction in the number of parameters against deep learning baselines. We prove that ICCTs can serve as universal function approximators and display analytically that ICCTs can be verified in linear time. Furthermore, we deploy ICCTs in two realistic driving domains, based on interstate Highway-94 and 280 in the US. Finally, we verify ICCT's utility with end-users and find that ICCTs are rated easier to simulate, quicker to validate, and more interpretable than neural networks.

A Computationally Efficient Sparsified Online Newton Method

  • Authors: Authors: Fnu Devvrit, Sai Surya Duvvuri, Rohan Anil, Vineet Gupta, Cho-Jui Hsieh, Inderjit Dhillon
  • Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Optimization and Control (math.OC)
  • Arxiv link: https://arxiv.org/abs/2311.10085
  • Pdf link: https://arxiv.org/pdf/2311.10085
  • Abstract Second-order methods hold significant promise for enhancing the convergence of deep neural network training; however, their large memory and computational demands have limited their practicality. Thus there is a need for scalable second-order methods that can efficiently train large models. In this paper, we introduce the Sparsified Online Newton (SONew) method, a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner. The algorithm emerges from a novel use of the LogDet matrix divergence measure; we combine it with sparsity constraints to minimize regret in the online convex optimization framework. Empirically, we test our method on large scale benchmarks of up to 1B parameters. We achieve up to 30% faster convergence, 3.4% relative improvement in validation performance, and 80% relative improvement in training loss, in comparison to memory efficient optimizers including first order methods. Powering the method is a surprising fact -- imposing structured sparsity patterns, like tridiagonal and banded structure, requires little to no overhead, making it as efficient and parallelizable as first-order methods. In wall-clock time, tridiagonal SONew is only about 3% slower per step than first-order methods but gives overall gains due to much faster convergence. In contrast, one of the state-of-the-art (SOTA) memory-intensive second-order methods, Shampoo, is unable to scale to large benchmarks. Additionally, while Shampoo necessitates significant engineering efforts to scale to large benchmarks, SONew offers a more straightforward implementation, increasing its practical appeal. SONew code is available at: https://github.com/devvrit/SONew

Adaptive Shells for Efficient Neural Radiance Field Rendering

  • Authors: Authors: Zian Wang, Tianchang Shen, Merlin Nimier-David, Nicholas Sharp, Jun Gao, Alexander Keller, Sanja Fidler, Thomas Müller, Zan Gojcic
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
  • Arxiv link: https://arxiv.org/abs/2311.10091
  • Pdf link: https://arxiv.org/pdf/2311.10091
  • Abstract Neural radiance fields achieve unprecedented quality for novel view synthesis, but their volumetric formulation remains expensive, requiring a huge number of samples to render high-resolution images. Volumetric encodings are essential to represent fuzzy geometry such as foliage and hair, and they are well-suited for stochastic optimization. Yet, many scenes ultimately consist largely of solid surfaces which can be accurately rendered by a single sample per pixel. Based on this insight, we propose a neural radiance formulation that smoothly transitions between volumetric- and surface-based rendering, greatly accelerating rendering speed and even improving visual fidelity. Our method constructs an explicit mesh envelope which spatially bounds a neural volumetric representation. In solid regions, the envelope nearly converges to a surface and can often be rendered with a single sample. To this end, we generalize the NeuS formulation with a learned spatially-varying kernel size which encodes the spread of the density, fitting a wide kernel to volume-like regions and a tight kernel to surface-like regions. We then extract an explicit mesh of a narrow band around the surface, with width determined by the kernel size, and fine-tune the radiance field within this band. At inference time, we cast rays against the mesh and evaluate the radiance field only within the enclosed region, greatly reducing the number of samples required. Experiments show that our approach enables efficient rendering at very high fidelity. We also demonstrate that the extracted envelope enables downstream applications such as animation and simulation.

Keyword: adam

There is no result

Keyword: gradient

Alternatives to the Scaled Dot Product for Attention in the Transformer Neural Network Architecture

  • Authors: Authors: James Bernhard
  • Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Neural and Evolutionary Computing (cs.NE)
  • Arxiv link: https://arxiv.org/abs/2311.09406
  • Pdf link: https://arxiv.org/pdf/2311.09406
  • Abstract The transformer neural network architecture uses a form of attention in which the dot product of query and key is divided by the square root of the key dimension before applying softmax. This scaling of the dot product is designed to avoid the absolute value of the dot products becoming so large that applying softmax leads to vanishing gradients. In this paper, we propose some alternative scalings, including dividing the dot product instead by the sum of the key lengths before applying softmax. We use simulated keys and queries to show that in many situations this appears to be more effective at avoiding regions where applying softmax leads to vanishing gradients.

LymphoML: An interpretable artificial intelligence-based method identifies morphologic features that correlate with lymphoma subtype

  • Authors: Authors: Vivek Shankar, Xiaoli Yang, Vrishab Krishna, Brent Tan, Oscar Silva, Rebecca Rojansky, Andrew Ng, Fabiola Valvert, Edward Briercheck, David Weinstock, Yasodha Natkunam, Sebastian Fernandez-Pol, Pranav Rajpurkar
  • Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.09574
  • Pdf link: https://arxiv.org/pdf/2311.09574
  • Abstract The accurate classification of lymphoma subtypes using hematoxylin and eosin (H&E)-stained tissue is complicated by the wide range of morphological features these cancers can exhibit. We present LymphoML - an interpretable machine learning method that identifies morphologic features that correlate with lymphoma subtypes. Our method applies steps to process H&E-stained tissue microarray cores, segment nuclei and cells, compute features encompassing morphology, texture, and architecture, and train gradient-boosted models to make diagnostic predictions. LymphoML's interpretable models, developed on a limited volume of H&E-stained tissue, achieve non-inferior diagnostic accuracy to pathologists using whole-slide images and outperform black box deep-learning on a dataset of 670 cases from Guatemala spanning 8 lymphoma subtypes. Using SHapley Additive exPlanation (SHAP) analysis, we assess the impact of each feature on model prediction and find that nuclear shape features are most discriminative for DLBCL (F1-score: 78.7%) and classical Hodgkin lymphoma (F1-score: 74.5%). Finally, we provide the first demonstration that a model combining features from H&E-stained tissue with features from a standardized panel of 6 immunostains results in a similar diagnostic accuracy (85.3%) to a 46-stain panel (86.1%).

GAIA: Delving into Gradient-based Attribution Abnormality for Out-of-distribution Detection

  • Authors: Authors: Jinggang Chen, Junjie Li, Xiaoyang Qu, Jianzong Wang, Jiguang Wan, Jing Xiao
  • Subjects: Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.09620
  • Pdf link: https://arxiv.org/pdf/2311.09620
  • Abstract Detecting out-of-distribution (OOD) examples is crucial to guarantee the reliability and safety of deep neural networks in real-world settings. In this paper, we offer an innovative perspective on quantifying the disparities between in-distribution (ID) and OOD data -- analyzing the uncertainty that arises when models attempt to explain their predictive decisions. This perspective is motivated by our observation that gradient-based attribution methods encounter challenges in assigning feature importance to OOD data, thereby yielding divergent explanation patterns. Consequently, we investigate how attribution gradients lead to uncertain explanation outcomes and introduce two forms of abnormalities for OOD detection: the zero-deflation abnormality and the channel-wise average abnormality. We then propose GAIA, a simple and effective approach that incorporates Gradient Abnormality Inspection and Aggregation. The effectiveness of GAIA is validated on both commonly utilized (CIFAR) and large-scale (ImageNet-1k) benchmarks. Specifically, GAIA reduces the average FPR95 by 23.10% on CIFAR10 and by 45.41% on CIFAR100 compared to advanced post-hoc methods.

On the Quantification of Image Reconstruction Uncertainty without Training Data

  • Authors: Authors: Sirui Bi, Victor Fung, Jiaxin Zhang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.09639
  • Pdf link: https://arxiv.org/pdf/2311.09639
  • Abstract Computational imaging plays a pivotal role in determining hidden information from sparse measurements. A robust inverse solver is crucial to fully characterize the uncertainty induced by these measurements, as it allows for the estimation of the complete posterior of unrecoverable targets. This, in turn, facilitates a probabilistic interpretation of observational data for decision-making. In this study, we propose a deep variational framework that leverages a deep generative model to learn an approximate posterior distribution to effectively quantify image reconstruction uncertainty without the need for training data. We parameterize the target posterior using a flow-based model and minimize their Kullback-Leibler (KL) divergence to achieve accurate uncertainty estimation. To bolster stability, we introduce a robust flow-based model with bi-directional regularization and enhance expressivity through gradient boosting. Additionally, we incorporate a space-filling design to achieve substantial variance reduction on both latent prior space and target posterior space. We validate our method on several benchmark tasks and two real-world applications, namely fastMRI and black hole image reconstruction. Our results indicate that our method provides reliable and high-quality image reconstruction with robust uncertainty estimation.

Gradient-Map-Guided Adaptive Domain Generalization for Cross Modality MRI Segmentation

  • Authors: Authors: Bingnan Li, Zhitong Gao, Xuming He
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.09737
  • Pdf link: https://arxiv.org/pdf/2311.09737
  • Abstract Cross-modal MRI segmentation is of great value for computer-aided medical diagnosis, enabling flexible data acquisition and model generalization. However, most existing methods have difficulty in handling local variations in domain shift and typically require a significant amount of data for training, which hinders their usage in practice. To address these problems, we propose a novel adaptive domain generalization framework, which integrates a learning-free cross-domain representation based on image gradient maps and a class prior-informed test-time adaptation strategy for mitigating local domain shift. We validate our approach on two multi-modal MRI datasets with six cross-modal segmentation tasks. Across all the task settings, our method consistently outperforms competing approaches and shows a stable performance even with limited training data.

Hijacking Large Language Models via Adversarial In-Context Learning

  • Authors: Authors: Yao Qiang, Xiangyu Zhou, Dongxiao Zhu
  • Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
  • Arxiv link: https://arxiv.org/abs/2311.09948
  • Pdf link: https://arxiv.org/pdf/2311.09948
  • Abstract In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific tasks by utilizing labeled examples as demonstrations in the precondition prompts. Despite its promising performance, ICL suffers from instability with the choice and arrangement of examples. Additionally, crafted adversarial attacks pose a notable threat to the robustness of ICL. However, existing attacks are either easy to detect, rely on external models, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable attack for ICL, aiming to hijack LLMs to generate the targeted response. The proposed LLM hijacking attack leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demonstrations. Extensive experimental results on various tasks and datasets demonstrate the effectiveness of our LLM hijacking attack, resulting in a distracted attention towards adversarial tokens, consequently leading to the targeted unwanted outputs.

DeepEMD: A Transformer-based Fast Estimation of the Earth Mover's Distance

  • Authors: Authors: Atul Kumar Sinha, Francois Fleuret
  • Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.09998
  • Pdf link: https://arxiv.org/pdf/2311.09998
  • Abstract The Earth Mover's Distance (EMD) is the measure of choice between point clouds. However the computational cost to compute it makes it prohibitive as a training loss, and the standard approach is to use a surrogate such as the Chamfer distance. We propose an attention-based model to compute an accurate approximation of the EMD that can be used as a training loss for generative models. To get the necessary accurate estimation of the gradients we train our model to explicitly compute the matching between point clouds instead of EMD itself. We cast this new objective as the estimation of an attention matrix that approximates the ground truth matching matrix. Experiments show that this model provides an accurate estimate of the EMD and its gradient with a wall clock speed-up of more than two orders of magnitude with respect to the exact Hungarian matching algorithm and one order of magnitude with respect to the standard approximate Sinkhorn algorithm, allowing in particular to train a point cloud VAE with the EMD itself. Extensive evaluation show the remarkable behaviour of this model when operating out-of-distribution, a key requirement for a distance surrogate. Finally, the model generalizes very well to point clouds during inference several times larger than during training.

Interpretable Reinforcement Learning for Robotics and Continuous Control

  • Authors: Authors: Rohan Paleja, Letian Chen, Yaru Niu, Andrew Silva, Zhaoxin Li, Songan Zhang, Chace Ritchie, Sugju Choi, Kimberlee Chestnut Chang, Hongtei Eric Tseng, Yan Wang, Subramanya Nageshrao, Matthew Gombolay
  • Subjects: Robotics (cs.RO)
  • Arxiv link: https://arxiv.org/abs/2311.10041
  • Pdf link: https://arxiv.org/pdf/2311.10041
  • Abstract Interpretability in machine learning is critical for the safe deployment of learned policies across legally-regulated and safety-critical domains. While gradient-based approaches in reinforcement learning have achieved tremendous success in learning policies for continuous control problems such as robotics and autonomous driving, the lack of interpretability is a fundamental barrier to adoption. We propose Interpretable Continuous Control Trees (ICCTs), a tree-based model that can be optimized via modern, gradient-based, reinforcement learning approaches to produce high-performing, interpretable policies. The key to our approach is a procedure for allowing direct optimization in a sparse decision-tree-like representation. We validate ICCTs against baselines across six domains, showing that ICCTs are capable of learning policies that parity or outperform baselines by up to 33% in autonomous driving scenarios while achieving a 300x-600x reduction in the number of parameters against deep learning baselines. We prove that ICCTs can serve as universal function approximators and display analytically that ICCTs can be verified in linear time. Furthermore, we deploy ICCTs in two realistic driving domains, based on interstate Highway-94 and 280 in the US. Finally, we verify ICCT's utility with end-users and find that ICCTs are rated easier to simulate, quicker to validate, and more interpretable than neural networks.

Keyword: super-resolution

DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics

  • Authors: Authors: Aniket Roy, Maiterya Suin, Anshul Shah, Ketul Shah, Jiang Liu, Rama Chellappa
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.09753
  • Pdf link: https://arxiv.org/pdf/2311.09753
  • Abstract Diffusion models have advanced generative AI significantly in terms of editing and creating naturalistic images. However, efficiently improving generated image quality is still of paramount interest. In this context, we propose a generic "naturalness" preserving loss function, viz., kurtosis concentration (KC) loss, which can be readily applied to any standard diffusion model pipeline to elevate the image quality. Our motivation stems from the projected kurtosis concentration property of natural images, which states that natural images have nearly constant kurtosis values across different band-pass versions of the image. To retain the "naturalness" of the generated images, we enforce reducing the gap between the highest and lowest kurtosis values across the band-pass versions (e.g., Discrete Wavelet Transform (DWT)) of images. Note that our approach does not require any additional guidance like classifier or classifier-free guidance to improve the image quality. We validate the proposed approach for three diverse tasks, viz., (1) personalized few-shot finetuning using text guidance, (2) unconditional image generation, and (3) image super-resolution. Integrating the proposed KC loss has improved the perceptual quality across all these tasks in terms of both FID, MUSIQ score, and user evaluation.

Scene Text Image Super-resolution based on Text-conditional Diffusion Models

  • Authors: Authors: Chihiro Noguchi, Shun Fukuda, Masao Yamanaka
  • Subjects: Computer Vision and Pattern Recognition (cs.CV)
  • Arxiv link: https://arxiv.org/abs/2311.09759
  • Pdf link: https://arxiv.org/pdf/2311.09759
  • Abstract Scene Text Image Super-resolution (STISR) has recently achieved great success as a preprocessing method for scene text recognition. STISR aims to transform blurred and noisy low-resolution (LR) text images in real-world settings into clear high-resolution (HR) text images suitable for scene text recognition. In this study, we leverage text-conditional diffusion models (DMs), known for their impressive text-to-image synthesis capabilities, for STISR tasks. Our experimental results revealed that text-conditional DMs notably surpass existing STISR methods. Especially when texts from LR text images are given as input, the text-conditional DMs are able to produce superior quality super-resolution text images. Utilizing this capability, we propose a novel framework for synthesizing LR-HR paired text image datasets. This framework consists of three specialized text-conditional DMs, each dedicated to text image synthesis, super-resolution, and image degradation. These three modules are vital for synthesizing distinct LR and HR paired images, which are more suitable for training STISR methods. Our experiments confirmed that these synthesized image pairs significantly enhance the performance of STISR methods in the TextZoom evaluation.

DSR-Diff: Depth Map Super-Resolution with Diffusion Model

  • Authors: Authors: Yuan Shi, Bin Xia, Rui Zhu, Qingmin Liao, Wenming Yang
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
  • Arxiv link: https://arxiv.org/abs/2311.09919
  • Pdf link: https://arxiv.org/pdf/2311.09919
  • Abstract Color-guided depth map super-resolution (CDSR) improve the spatial resolution of a low-quality depth map with the corresponding high-quality color map, benefiting various applications such as 3D reconstruction, virtual reality, and augmented reality. While conventional CDSR methods typically rely on convolutional neural networks or transformers, diffusion models (DMs) have demonstrated notable effectiveness in high-level vision tasks. In this work, we present a novel CDSR paradigm that utilizes a diffusion model within the latent space to generate guidance for depth map super-resolution. The proposed method comprises a guidance generation network (GGN), a depth map super-resolution network (DSRN), and a guidance recovery network (GRN). The GGN is specifically designed to generate the guidance while managing its compactness. Additionally, we integrate a simple but effective feature fusion module and a transformer-style feature extraction module into the DSRN, enabling it to leverage guided priors in the extraction, fusion, and reconstruction of multi-model images. Taking into account both accuracy and efficiency, our proposed method has shown superior performance in extensive experiments when compared to state-of-the-art methods. Our codes will be made available at https://github.com/shiyuan7/DSR-Diff.

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

  • Authors: Authors: Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, Yaniv Taigman
  • Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
  • Arxiv link: https://arxiv.org/abs/2311.10089
  • Pdf link: https://arxiv.org/pdf/2311.10089
  • Abstract Instruction-based image editing holds immense potential for a variety of applications, as it enables users to perform any editing operation using a natural language instruction. However, current models in this domain often struggle with accurately executing user instructions. We present Emu Edit, a multi-task image editing model which sets state-of-the-art results in instruction-based image editing. To develop Emu Edit we train it to multi-task across an unprecedented range of tasks, such as region-based editing, free-form editing, and Computer Vision tasks, all of which are formulated as generative tasks. Additionally, to enhance Emu Edit's multi-task learning abilities, we provide it with learned task embeddings which guide the generation process towards the correct edit type. Both these elements are essential for Emu Edit's outstanding performance. Furthermore, we show that Emu Edit can generalize to new tasks, such as image inpainting, super-resolution, and compositions of editing tasks, with just a few labeled examples. This capability offers a significant advantage in scenarios where high-quality samples are scarce. Lastly, to facilitate a more rigorous and informed assessment of instructable image editing models, we release a new challenging and versatile benchmark that includes seven different image editing tasks.

zoq avatar Nov 17 '23 07:11 zoq