LightRAG icon indicating copy to clipboard operation
LightRAG copied to clipboard

[Bug]: PostgreSQL database and Doc (asyncpg.exceptions.CharacterNotInRepertoireError: invalid byte sequence for encoding "UTF8": 0x00)

Open YanSte opened this issue 10 months ago • 0 comments

Do you need to file an issue?

  • [x] I have searched the existing issues and this bug is not already filed.
  • [x] I believe this is a legitimate bug, not just a question or feature request.

Describe the bug

  • Try to upload a document
  • And got with posgres asyncpg.exceptions.CharacterNotInRepertoireError: invalid byte sequence for encoding "UTF8": 0x00
  • This document stay in pending and always fail.

Steps to reproduce

  • Upload a document (https://arxiv.org/html/2407.05750v1)
  • By the LightRAG server

Expected Behavior

  • Upload the document and process it

LightRAG Config Used

  • Master LightRag

Logs and screenshots

2025-02-19 22:18:54 ERROR:PostgreSQL database,
2025-02-19 22:18:54 sql:insert into LIGHTRAG_DOC_STATUS(workspace,id,content,content_summary,content_length,chunks_count,status)
2025-02-19 22:18:54                  values($1,$2,$3,$4,$5,$6,$7)
2025-02-19 22:18:54                   on conflict(id,workspace) do update set
2025-02-19 22:18:54                   content = EXCLUDED.content,
2025-02-19 22:18:54                   content_summary = EXCLUDED.content_summary,
2025-02-19 22:18:54                   content_length = EXCLUDED.content_length,
2025-02-19 22:18:54                   chunks_count = EXCLUDED.chunks_count,
2025-02-19 22:18:54                   status = EXCLUDED.status,
2025-02-19 22:18:54                   updated_at = CURRENT_TIMESTAMP,
2025-02-19 22:18:54 data:{'workspace': 'default', 'id': 'doc-8774e5c30ba9c1e89d4e739015e07afc', 'content': 'Published as a conference paper at ICLR 2024\nHYBRID LLM: C OST-EFFICIENT AND QUALITY -\nAWARE QUERY ROUTING\nDujian Ding1∗, Ankur Mallick2, Chi Wang2, Robert Sim2, Subhabrata Mukherjee3†,\nVictor Ruhle2, Laks V . S. Lakshmanan1, Ahmed Awadallah2\n1University of British Columbia2Microsoft3Hippocratic AI\[email protected] [email protected]\nABSTRACT\nLarge language models (LLMs) excel in most NLP tasks but also require expen-\nsive cloud servers for deployment due to their size, while smaller models that can\nbe deployed on lower cost (e.g., edge) devices, tend to lag behind in terms of re-\nsponse quality. Therefore in this work we propose a hybrid inference approach\nwhich combines their respective strengths to save cost and maintain quality. Our\napproach uses a router that assigns queries to the small or large model based on\nthe predicted query difficulty and the desired quality level. The desired quality\nlevel can be tuned dynamically at test time to seamlessly trade quality for cost as\nper the scenario requirements. In experiments our approach allows us to make up\nto 40% fewer calls to the large model, with no drop in response quality.\n1 I NTRODUCTION\nLarge language models (LLMs) have become the dominant force in natural language processing\nin recent years (Zhao et al., 2023). Their impact has been especially striking in generative ap-\nplications where it has extended beyond standard language understanding and question-answering\nbenchmarks like (Hendrycks et al., 2020; Srivastava et al., 2022) to several successful real-world\ndeployments. These include the wildly popular ChatGPT (OpenAI, b) and several other chatbots\n(Zheng et al., 2023) powered by different LLMs (Taori et al., 2023; Touvron et al., 2023; OpenAI,\n2023), which allow users to engage in natural language conversations and obtain informative re-\nsponses on a range of practically useful tasks like creative writing, translation, code completion,\netc. An important added attraction of these models is their accessibility. Users can input queries\nand receive responses in natural language, without any specialized data or code, and this is what has\ncreated such a widespread demand for their services across regions, professions, and disciplines.\nThe best performing LLMs are based on the transformer architecture of (Vaswani et al., 2017) and\ngenerally have tens of billions of parameters. E.g., Alpaca (Taori et al., 2023) has 13 billion param-\neters, the best version of Llama-2 (Touvron et al., 2023) has 70 billion parameters, while OpenAI’s\nGPT-3.5 (OpenAI, a) and GPT-4 (OpenAI, 2023) are rumored to be much larger. Their enormous\nsize and the autoregressive nature of text generation in their transformer architectures means that\nthese models typically have a high compute and memory requirement that can only be met by ex-\npensive cloud servers (Yu et al., 2022). This can potentially impose an enormous cost on developers\nand users as more LLM-based services are introduced. In response to this there has been a surge of\ninterest in designing smaller, cost-effective LLMs – e.g., (Touvron et al., 2023) provides multiple\nversions of Llama-2, with the smallest having only 7 billion parameters, small enough to run on\na laptop1, while the smallest offering of Google’s Palm-2 model can even run on mobile devices2.\nHowever empirical evaluations in (Chung et al., 2022; Touvron et al., 2023) as well as our own\nevaluation in Figure 1a show that smaller models generally lag behind in terms of response quality.\n∗work performed during internship at Microsoft\n†work performed while at Microsoft\n1https://github.com/microsoft/Llama-2-Onnx\n2https://blog.google/technology/ai/google-palm-2-ai-large-language-model/\n1\nPublished as a conference paper at ICLR 2024\n(a) Accuracy v/s size of LLM\n (b) Tail of accuracy difference\n (c) Results with routing\nFigure 1: We use a dataset of natural language queries from a range of tasks like question answering,\nsummarization, information extraction, etc. (See Section 4 for details). We observe that (a) smaller\nmodels generally give poorer response quality or lower BART score (Yuan et al., 2021), (b) Llama-2\n(13b) outperforms GPT-3.5-turbo on around 20% examples, and (c) our router can make 22% fewer\ncalls to GPT-3.5-turbo (cost advantage) with 1%drop in response quality (BART score).\nFaced with this tradeoff between response quality and inference cost, we propose a hybrid inference\napproach which provides the best of both worlds. Our approach is motivated by the observation\nthat most tasks for which LLMs are useful, like creative writing, translation, code completion, etc.,\ninclude a range of queries of different difficulty levels and there is always a subset of “easy” queries\nfor which responses of a small (inexpensive and weak) model may be comparable to, and sometimes\neven better than those of a large (expensive and powerful) model. This is also illustrated in Figure 1b\nwhere we plot the tail of the quality gap (defined in Section 3) between the 13 billion parameter\nversion of Llama-2 and OpenAI’s GPT-3.5-turbo, the model that powers ChatGPT. Quality gap is\nnon-negative for examples where the response quality of Llama-2 is comparable to or better than that\nof GPT-3.5-turbo which is the case for around 20% queries in our dataset (described in Section 4).\nWe leverage this insight to train a router that takes a large model and a small model as input, and\nlearns to identify these easy queries as a function of the desired level of response quality, while\ntaking into account the generative nature of tasks, inherent randomness in LLM responses, and\nresponse quality disparity between the two models. At test time, the router seamlessly adjusts to\ndifferent response quality requirements and assigns the corresponding “easy” queries to the small\nmodel, leading to significant inference cost reduction with minimal drop in response quality. In\nFigure 1c our router assigns 22% of queries to Llama-2 (13b)3with less than 1%drop in response\nquality measured in BART scores (Yuan et al., 2021). The gains are even higher for pairs where the\nsmall model is closer in terms of response quality to the large model (see Section 4).\nWith the explosion in the complexity and costs of LLM deployments, small companies and individ-\nual consumers, have started to rely on the pre-existing LLMs hosted on platforms like HuggingFace\n(HuggingFace) and OpenAI (OpenAI, c). In this context, our hybrid inference approach can reduce\nthe costs incurred by both consumers and platform owners because a) consumers can use it to route\neasy queries to small models hosted on their edge devices (laptops/smartphones) and only call the\nAPI for the more complex queries (illustrated in Figure 2) and b) platform owners can automatically\nroute queries to lower cost models at the backend without affecting the user experience, as long as\nthe response quality levels are maintained. Thus our hybrid inference approach offers a flexible and\ncost-effective solution for harnessing the full potential of LLMs while accommodating diverse cost\nbudgets and quality requirements.\nThe main technical contributions of this work are: a) we are the first to explore cost-effective and\nquality-aware hybrid LLM inference, b) we design a novel query router which routes queries based\non an estimate of the response quality gap between models (Section 3.1), c) we incorporate un-\ncertainty due to randomness in LLM responses in our router design to improve performance (Sec-\ntion 3.2), d) we identify challenges for our router when the small model is significantly weaker than\nthe large model and introduce a novel data transformation to address this issue (Section 3.3), and\ne) we provide extensive experimental results (Section 4) on a large benchmark dataset of real world\nnatural language queries and responses (Jiang et al., 2023) thereby demonstrating the value of the\n3We term the fraction of queries routed to the small model as the cost advantage (see §2.3)\n2\nPublished as a conference paper at ICLR 2024\nFigure 2: Routing between edge and cloud.\napproach and its superiority over baseline approaches, enabling LLM providers and consumers to\ncost-efficiently enable LLM-backed experiences.\n2 P ROBLEM FORMULATION\n2.1 R ELATED WORK\nLarge Language Models (LLMs). The advent of LLMs has led to a paradigm shift in the\nstudy of natural language processing (NLP), computer vision, information retrieval, and other do-\nmains(Menghani, 2023; Chen et al., 2023; Jiang et al., 2023). The impressive effectiveness and gen-\neralizability of LLMs has come at the price of a drastic increase in LLM sizes (Treviso et al., 2023)\nand consequent challenges, including huge amounts of computational resources and data required\nto train, and prohibitive expenses at both training and deployment stages (Bender et al., 2021).\nEfficient Machine Learning (ML) Inference. LLMs belong to a class of models called foundation\nmodels (Bommasani et al., 2021) – models that are trained once and can then be used to serve a wide\nvariety of tasks. As such, we expect inference cost to dominate the overall cost of such models and\nhence focus on works that reduce the cost of ML inference (Menghani, 2023). Common techniques\nfor efficient ML inference include model pruning (LeCun et al., 1989), quantization (Jacob et al.,\n2018), knowledge distillation (Hinton et al., 2015), and Neural Architecture Search (Elsken et al.,\n2019). Such static efficiency optimizations typically produce a fixed model with lower inference\ncost and lower accuracy compared to the large model which may not suffice for foundation models\nlike LLMs, whose core premise is that the same model will serve a range of tasks, each with its own\naccuracy/cost constraints. This is already manifesting in inference platforms described in Section 1\nwhich need more dynamic optimizations to meet the demands of all users.\nHybrid ML Inference. Recent works (Kag et al., 2022; Ding et al., 2022) have introduced a new\ninference paradigm called hybrid inference which uses two models of different sizes instead of a\nsingle model for inference. The smaller model (e.g. Llama-2 (Touvron et al., 2023)) generally has\nlower inference cost but also lower accuracy than the larger model (e.g. GPT-4 (OpenAI, 2023)). The\nkey idea is to identify and route easy queries to the small model so that inference cost can be reduced\nwhile maintaining response quality. By tuning a threshold on query difficulty we can dynamically\ntrade off quality and cost for the same inference setup. (Kag et al., 2022) study this setup for image\nclassification and propose to train the small model, large model, and router from scratch. However\nLLM training is expensive and retraining LLMs from scratch for every scenario goes against the\nvery premise of inference with pre-trained foundation models. Moreover text generation (Iqbal &\nQureshi, 2022) is often more ambiguous and challenging than image classification due to which\nnovel techniques are required for effective hybrid LLM inference for text generation.\nInference with Multiple LLMs. Some recent works use multiple LLMs for inference but these\napproaches typically call more than one LLM for a single query that can incur significant compu-\n3\nPublished as a conference paper at ICLR 2024\ntational overheads. Specifically (Jiang et al., 2023) calls an ensemble of LLMs at inference time\ndue to which the inference cost will be proportional to the number of models in the system. (Chen\net al., 2023) performs inference by sequentially calling LLMs until one has a high confidence score\nexceeding the predefined threshold. Our work provides high quality responses while always call-\ning a single LLM for all queries and will thus incur much lower costs than both of these works\non average. Speculative decoding (Leviathan et al., 2023; Kim et al., 2023) speeds up decoding of\nexpensive LLMs by invoking small decoders on the “easy” decoding steps. Instead, in our work we\nare interested in query routing which assigns “easy” queries to small models to reduce overall infer-\nence costs while maintaining high performance. While these two approaches have different goals,\nan interesting line of future work would be to combine these to achieve further cost reduction.\n2.2 P ROBLEM SETTING\nWe extend the hybrid ML paradigm to LLM inference by routing queries between two models with\ndifferent inference costs and accuracy. We use XandZto denote the input query space and the set of\nall possible output responses respectively. Let L:X → Z denote the large model and S:X → Z\ndenote the small model. Formally, the objective in our paradigm is to learn a router r:X → { 0,1}\nsuch that each user query x∈ X is routed to the small model S(x)ifr(x) = 0 , and to the large\nmodel L(x), otherwise. Note that we always route each query to a single LLM at inference time as\nopposed to using an ensemble (Jiang et al., 2023) or a cascade (Chen et al., 2023) of LLMs, which\nmay call multiple LLMs to resolve a single query and incur significant computational overheads.\n2.3 E VALUATION METRIC\nResponse Quality Automatic evaluation for text generation is a challenging and widely studied\nproblem. Traditional metrics, such as BLEU and ROUGE, initially designed for machine translation\nand summarization, have been found to be of limited concordance with human judgment and re-\nstricted applicability across diverse NLP tasks (Blagec et al., 2022). Significant research efforts have\nbeen devoted to implementing task-agnostic evaluation metrics with neural networks. GPT-ranking\n(Jiang et al., 2023), as a representative example, employs GPT models (e.g., GPT-4 (OpenAI, 2023))\nto provide relative rankings between pairs of generated outputs. In spite of the high correlation with\nhuman perception, GPT-ranking suffers from high computational costs and inability to distinguish\nbetween examples with the same ranking. Instead, we use the BART score (Yuan et al., 2021) to\nevaluate response quality of different models since (1) it is inexpensive to compute in comparison to\nLLM-based metrics such as GPT-ranking, and (2) it has been shown in prior work (Jiang et al., 2023)\nthat this metric correlates well with the ground truth. We also provide a case study in Appendix C.2\nto empirically justify using BART score as the response quality metric. We use q(z), q:Z →Rto\ndenote the BART score (response quality) of model responses z∈ Z.\nCost Advantage The absolute costs of running a model may not be known a priori , and may be\nexpressed using a variety of metrics, including latency, FLOPs, energy consumption, etc. In LLM\ninference, however, each of these metrics is affected by several underlying confounders such as\ndifferent prompt templates, hardware capability, network connectivity, etc. Moreover different plat-\nforms/users may be interested in different metrics. However the common underlying assumption\nin this and previous works on efficient ML inference is that smaller models are more efficient than\nlarger models and therefore we expect to obtain an improvement in all the metrics by routing more\nqueries to the smaller model. Hence we define cost advantage as the percentage of queries routed to\nthe smaller model. Note that the notion cost advantage has been used as a generic efficiency metric\nin previous hybrid ML work (Kag et al., 2022), where it is termed as coverage .\n3 H YBRID LLM I NFERENCE\nEasy Queries. We refer to queries for which the response quality of the small model is close to the\nresponse quality of the large model as “easy” queries. The goal of our hybrid inference framework is\nto identify the easy queries and route them to the small model thereby ensuring significant inference\ncost reduction without much drop in response quality. Note that the easy queries as defined here,\nneed not necessarily be queries that are easy/inexpensive to respond to, they are just queries for\n4\nPublished as a conference paper at ICLR 2024\nwhich the small model can match up to the large model. Examples of easy and hard queries as per\nthis definition a2025-02-19T21:18:54.080541709Z re provided in Appendix C.1.\nQuality Gap. We define quality gap of a query xasH(x) :=q(S(x))−q(L(x))i.e. the difference\nin quality of the small model’s response S(x)and the large model’s response L(x). The quality\ngap is a random variable since LLM responses are typically non-deterministic. This is illustrated\nin Figure 3 below where the blue and orange plots correspond to the distribution of responses from\nFLAN-t5 (800m)4(Chung et al., 2022) and Llama-2 (13b) (Touvron et al., 2023) for a single query.\nFigure 3: Response quality distribu-\ntion for FLAN-t5 (800m) and Llama-\n2 (13b) on the query “How to iden-\ntify the index of median?” measured\nin BART scores. Llama-2 (13b) with\ntransformation significantly overlaps\nwith FLAN-t5 (800m).Proposed Orchestration Framework. Queries are routed\nusing a BERT-style encoder model (e.g., DeBERTa (He\net al., 2020)) which is trained on a dataset of representa-\ntive queries and learns to predict a score. Since the router\nis an encoder model, a single pass of the query through it\nis sufficient to generate the score and we assume that the\ncost of this step is negligible compared to the cost of run-\nning autoregressive decoding using the large model L(x)\n(Sun et al., 2019). Thus, we expect that using the router\nto route queries to the small model will not detract signifi-\ncantly from the realizable cost advantage.\nRouter Score. We design the router score to be large for\neasy queries as defined above. Intuitively, an estimate of\nPr[H(x)≥0]is a suitable candidate since a large value of\nPr[H(x)≥0] = Pr[ q(S(x))≥q(L(x))]corresponds to\nqueries for which there is a high likelihood that the response\nquality of the small model will be at least as high as that of\nthe large model. However we show below that in scenarios\nwhere the large model is significantly more powerful than\nthe small model i.e. q(S(x))<< q (L(x))in general, one\ncan train more effective routers by relaxing the definition of\neasy queries to Pr[H(x)≥ −t] = Pr[ q(S(x))≥q(L(x))−t]for an appropriate t >0. At test\ntime we achieve the desired performance accuracy tradeoff by tuning a threshold on the score and\nrouting queries with score above the threshold to the small model. For a router with parameters w,\nwe denote router score by pw(x), pw:X → [0,1]. We discuss different router score designs in the\nrest of this section assuming a training set of Nqueries x1, . . . , x N.\n3.1 D ETERMINISTIC ROUTER\nPrevious work on hybrid ML (Ding et al., 2022; Kag et al., 2022) assumes that neural models are\ndeterministic functions mapping input features to a single point in the output space. To realize\nthis for LLMs, we sample a single response per query from each model. We assign boolean labels\nydet\ni= 1[q(S(xi))≥q(L(xi))]to each training query xiwith the BART score as the quality function\nq(.). The router is trained by minimizing the binary cross-entropy loss (Ruby & Yendapalli, 2020).\nL(w) =−1\nNNX\ni=1\x10\nydet\nilog(pw(xi)) + (1 −ydet\ni) log(1 −pw(xi))\x11\n(1)\nObserve that the assigned labels ydet\nican be viewed as an estimate for Pr[H(xi)≥0]given a single\nresponse per query from each model and thus minimizing the above loss encourages the router score\npw(x)to be close to Pr[H(x)≥0]for test queries. We refer to this deterministic router as rdet.\n3.2 P ROBABILISTIC ROUTER\nThe determinism assumption can be justified for tasks where the ground truth labels are often explicit\nand unique such as image classification (Masana et al., 2022) and video segmentation (Yao et al.,\n2020). When it comes to NLP tasks, however, there is usually no single best answer due to the\nintrinsic ambiguity and complexity of natural languages. LLMs are widely used as non-deterministic\ngenerators to capture the intrinsic uncertainty of NLP tasks, as shown in Figure 3 (ignore the dashed\n4We use the FLAN-t5-large model from https://huggingface.co/google/flan-t5-large.\n5\nPublished as a conference paper at ICLR 2024\n(a) Before transformation.\n (b) Grid search for the best t.\n (c) After transformation.\nFigure 4: Effect of data transformation on labels for training the router.\ncurve for now). The non-determinism mainly comes from the randomness in the decoding phase.\nUsers typically control the level of uncertainty by choosing different decoding strategies such as\nnucleus sampling (Holtzman et al., 2019), as well as the values of the hyper-parameter temperature .\nIntuitively, higher temperature values result in a higher level of randomness and diversity among the\ngenerated responses. For black-box LLM APIs such as GPT-4 (OpenAI, 2023), it has been observed\nthat even upon setting temperature to the minimum value 0, it can still provide different responses\nfor the same input queries. The underlying mechanism is still an open problem while a recent study\nhints at the instability of the MoE backbone (Skyward, 2023).\nWe propose to incorporate the uncertainty into the router training loss by relaxing the hard la-\nbelsydet\ni∈ {0,1}to the soft labels yprob\ni:= Pr[ H(xi)≥0] = Pr[ q(S(xi))≥q(L(xi))] =\nE[ 1[q(S(xi))≥q(L(xi))]]where Edenotes the expectation. In practice, we estimate expecta-\ntion by sampling 10responses from each model and computing the sample average of the corre-\nsponding indicator function values. Observe that the hard label ydet\niis a higher-variance estimate\nofE[ 1[q(S(xi))≥q(L(xi))]](since it is obtained from a single sample) and hence we expect im-\nproved performance of the probabilistic router with the following training loss, referred to as rprob.\nL(w) =−1\nNNX\ni=1\x10\nyprob\nilog(pw(xi)) + (1 −yprob\ni) log(1 −pw(xi))\x11\n(2)\n3.3 P ROBABILISTIC ROUTER WITH DATA TRANSFORMATION\nWhile so far we have designed scores that try to estimate Pr[H(x)≥0], we observe that the\nempirical estimate of Pr[H(xi)≥0] =E[ 1[q(S(xi))≥q(L(xi))]]tends to be extremely small\nwhen the large model is significantly more powerful than the small model (0 for almost 90% of the\nqueries in Figure 4a with Flan-t5 (800m) as the small model and Llama-2 (13b) as the large model).\nBecause q(S(x))<< q (L(x))for most queries in this case, it provides an extremely weak signal\nfor training using Equation (2) and as shown in Section 4 both rdetandrprob fail to provide much\nimprovement over random query assignment in this case.\nTraditional approaches for learning with imbalanced data have their own shortcomings (Krawczyk,\n2016). Moreover our goal is to only design a router that can reduce inference cost while maintaining\nresponse quality as much as possible and so we are not tied to a particular definition of class labels\nto achieve this. We leverage this flexibility to introduce new labels ytrans\ni(t) := Pr[ H(xi)≥ −t] =\nPr[q(S(xi))> q(L(xi))−t]for some t >0. Since −t <0,Pr[H(x)≥ −t]≥Pr[H(x)≥0]\nby definition of the tail distribution and so we expect this relaxation to provide a stronger signal\nfor router training while still allowing us to identify the easy queries i.e. those queries for which\nq(S(x))has a high likelihood of being close to q(L(x))(q(S(x))> q(L(x))−t). Visually, this\ncorresponds to comparing the distribution of the small model’s response with a shifted distribution\nof the large model’s response to a query (dotted curve in Figure 3).\nNow the question is how to choose the best relaxation t?Given that tail probability Pr[H(x)≥ −t]\nlies in [0,1], we choose tby maximizing the average pairwise differences between the transformed\nlabels to push them as far apart as possible and provide a strong signal for training. Thus we set,\nt∗= arg max\nt1\nN2X\n(i,i′)|ytrans\ni(t)−ytrans\ni′(t)| (3)\n6\nPublished as a conference paper at ICLR 2024\nTable 1: Cost advantage v.s. performance drop for model pairs of different performance gaps.\nPerformance drops are computed w.r.t. the all-at-large baseline.\nCost\nAdvantage\n(%)Response Quality (BART Score) Drop w.r.t all-at-large (%)\nS: Llama-2 (7b) S: Llama-2 (13b) S: FLAN-t5 (800m)\nL: Llama-2 (13b) L: GPT-3.5-turbo L: Llama-2 (13b)\nrdet rprob rtrans rdet rprob rtrans rdet rprob rtrans\n10 0.1 -0.1 0.1 0.1 -0.1 0.2 2.3 2.2 2.1\n20 0.1 0.0 0.0 1.0 0.8 0.8 5.8 5.8 4.7\n40 0.2 0.1 0.0 3.5 3.4 2.9 13.8 13.1 10.3\nWe currently solve the above optimization problem via grid-search and leave more sophisticated\napproaches for future work. We plot the optimization objective for different values of tfor our\ntraining dataset in Section 3.3 and show the distribution of transformed labels ytrans\ni(t∗)in Figure 4c.\nAs we see, the distribution is significantly more balanced now and we expect the resulting router to\nbe much more effective. We train the router by minimizing the loss, referred to as rtrans .\nL(w) =−1\nNNX\ni=1\x00\nytrans\ni(t∗) log( pw(xi)) + (1 −ytrans\ni(t∗)) log(1 −pw(xi))\x01\n(4)\n4 E VALUATION\n4.1 E VALUATION SETUP\nDataset. We use the MixInstruct dataset from (Jiang et al., 2023) to evaluate the effectiveness of\ndifferent routing strategies. MixInstruct consists of a wide range of tasks (e.g., question answering,\nsummarization, information extraction) and enables us to train a generic router that will be effective\nacross different scenarios. We present additional information about this dataset in Appendix B.\nRouter Model. We use DeBERTa-v3-large (He et al., 2020) (300M) as the backbone to train our\nrouters. We train each router with the corresponding loss from Section 3 for 5epochs and use\nthe validation set to choose the best checkpoints for final evaluation. All experiments are con-\nducted with 1NVIDIA A100 GPU of 80GB GPU RAM. We have made our source code available\nat https://github.com/m365-core/hybrid llmrouting.\nEvaluation Measures. We use BART score (Yuan et al., 2021) as the quality metric and use fraction\nof queries routed to the small model ( cost advantage ) as the efficiency metric (see Section 2.3).\nBaselines. We consider three baselines: all-at-large ,all-at-small , and random .All-at-large routes\nall queries to the large model, while all-at-small routes all queries to the small model. Random gen-\nerates a random number in [0,1] and selects the large model if it is below the probability threshold.\nExperiments. We investigate all three routers: rdet,rprob, and rtrans . We select candidate model\npairs from FLAN-T5 (800m), FLAN-T5 (11b), Llama-2 (7b), Llama-2 (13b), and GPT-3.5-turbo\nfor our experiments. At test time the trained router ( rdet,rprob, orrtrans ) takes a threshold value as\ninput and routes all queries with router score higher than the threshold to the small model as these\nare the easy queries. We evaluate the router performance in Section 4.2 in terms of both BART\nscore and cost advantage (Figure 5 and Table 1), validate that the router is indeed routing easy\nqueries to the small model in Section 4.3, demonstrate that our routers are of negligible compute\noverhead in Appendix A.1, show how to choose routing thresholds in practice in Appendix A.2,\nevaluate the effectiveness of our routers using a response quality metric other than the BART score\nin Appendix A.3, and test the generalizability of routers across model pairs in Appendix A.4.\n4.2 R OUTER PERFORMANCE RESULTS\nSmall performance gap. LLMs of the same architectures are observed to be of small performance\ngap such as Llama-2 (7b) v.s. Llama-2 (13b), as seen in Figure 5a. In this case, by trading little to\n7\nPublished as a conference paper at ICLR 2024\n(a) Small performance gap\n (b) Medium performance gap\n (c) Large performance gap\nFigure 5: Error-cost tradeoffs achieved by rdet,rprob, andrtrans for different performance gaps.\nno performance drop, we show that (1) the deterministic router rdetcan achieve good cost advan-\ntages, (2) rprob consistently improves rdet, and (3) rtrans is able to match or slightly improve the\nperformance of rprob. Numerical comparison results are summarized in Table 1. rdetroutes 20%\n(40%) queries to the small model i.e. Llama-2 (7b) with only 0.1%(0.2%) drop in response quality\nw.r.t. the all-at-large baseline. Impressively rprobandrtrans achieve 20% cost advantages without\nany quality drop, and rtrans is able to achieve even 20% cost advantage without quality drop, which\ncan be attributed to these methods capturing the non-deterministic nature of LLMs.\nMedium performance gap. Often there is only a moderate performance gap between leading open-\nsource LLMs like Llama-2 (13b) and state-of-the-art commodified LLMs, such as GPT-3.5-turbo\n(Figure 5b). In this case, all our routers deliver reasonable cost advantages with acceptable quality\ndrop. The effectiveness order between rdet,rprob, andrtrans resembles that in the small quality gap\ncase. All routers achieve 20% (40%) cost advantage with ≤1%(≤4%) quality drop (Table 1). In\nthe40% cost advantage regime, rprobslightly outperforms rdetandrtrans improves rprobby0.5%\nin terms of quality drop.\nLarge performance gap. In the edge-cloud routing scenarios, edge devices often have very limited\nresources and can only support small models of limited quality, which can be significantly outper-\nformed by large models deployed on the cloud. We investigate how to effectively route queries with\nLLM pairs of large performance gaps, such as FLAN-t5 (800m) and Llama-2 (13b) (Figure 5c).\nNon-trivial routing is challenging in this situation since the large model dominates for a majority\nof examples. Both rdetandrprob perform marginally better than the random routing baseline. In\ncontrast, rtrans can still effectively distinguish relatively easy queries from the harder ones. rtrans\nachieves 40% cost advantages with 10.3%quality drop, which is 3.5%and2.8%lower than rdet\nandrprobrespectively (Table 1).\nIn the course of these experiments we made two interesting observations. Firstly, when the cost\nadvantage is modest (e.g., 10%) and the LLM performance gaps are not large (e.g., Llama-2 (7b) v.s.\nLlama-2 (13b)) rprobis able to achieve even better performance than all-at-large which leads to the\n“negative quality drops” in Table 1. This is because, as seen from the large value of Pr[H(x)≥0]\nin the tail distribution in Figure 5, the response quality of the small model may be higher than that\nof the large model for several queries and by routing these queries to the small model, the router is\nable to even beat all-at-large . Secondly, for lower cost advantages ( ≤10%) and small or moderate\nLLM performance gaps rtrans can be slightly outperformed by rdetorrprob. This might be due\nnoise in the estimation of the relaxation parameter tfrom sample averages instead of expectation in\nEquation (3) and from the grid search process leading to suboptimal settings of rtrans . However we\nclearly see that in more challenging routing scenarios with high cost advantage targets or large LLM\n8\nPublished as a conference paper at ICLR 2024\n(a) Small performance diff.\n (b) Medium quality gaps.\n (c) Large quality gaps.\nFigure 6: Difference between average quality gap of queries routed to the small and large models\nwith different performance gaps.\nperformance gaps, both rdetandrprobhave difficulty in correctly routing queries, and rtrans starts\nto dominate due to the benefits of the data transformation.\n4.3 R OUTER VALIDATION RESULTS\nWe also validate that the router is functioning as intended, that is, routing easy queries to the small\nmodel and hard queries to the large model. To see this, in Figure 6 we plot the difference between\nthe average quality gaps of queries routed to the small model and those routed to the large model\nfor our router and the random baseline v/s different values of cost advantages (i.e., the fraction of\nqueries routed to the small model). Since the random baseline randomly assigns queries the average\ndifference is nearly always zero. However our router routes easy queries i.e. queries with large\nquality gap ( q(S(x))−q(L(x))) to the small model and queries with small quality gap to the large\nmodel. Hence the difference between the average quality gaps always has a significant positive\nvalue indicating that more easy queries are routed to the small model than to the large model in our\napproach as compared to the random assignment approach at all cost advan2025-02-19T21:18:54.080541709Z tages.\n5 D ISCUSSION AND CONCLUSION\nMotivated by the need to optimize the trade-off between LLM inference costs and response quality,\nwe have presented a hybrid inference approach based on quality-aware query routing. We train a\nrouter to discriminate between “hard” and “easy” queries, enabling the LLM provider to make cost-\nefficient decisions about which model should serve a given query. Our experimental results on a\nvariety of state-of-the-art LLMs of varying sizes show that such an optimization is possible and that\nwe can realize cost advantages of up to 40% with no significant drop in response quality.\nTo the best of our knowledge, this is the first work exploring the possibilities of cost-effective and\nquality-aware query routing between LLMs. We identify several important extensions for future\nwork: (1) Task-aware routing . Our current routers make routing decisions purely based on query\ninputs. To improve routing effectiveness, we can provide more informative signals which help\nrouters distinguish easy queries from the hard ones, such as task labels for query examples and can\nalso identify tasks which may be more suited to routing for a given pair of LLMs. (2) Generalizing\ntoN-model routing . Modern MLaaS platforms typically host a large number of LLM instances\nof the same or different configurations to efficiently serve users in different scenarios. This natu-\nrally forms a more challenging routing problem with richer optimization opportunities (e.g., load\nbalancing) (3) Out-of-distribution (OOD) generalization . In this work, the model pair and data\ndistribution is fixed across training and testing. In the real-world it may be cumbersome/infeasible to\ntrain a new router for every new model pair and for every new data distribution. Therefore there is a\nneed for techniques to generalize our approach to changes in the model pair and/or data distribution\nat test time. (4) Novel evaluation metrics . Effective evaluation metrics are critical to train high-\nquality routers. It is intriguing to see how to develop metrics of higher human-judgment correlation\nand to which extent it will improve the routing performance.\n9\nPublished as a conference paper at ICLR 2024\nACKNOWLEDGMENTS\nThe authors would like to thank Daniel Madrigal Diaz, Mirian del Carmen Hipolito Garcia, Chen\nDun, Guoqing Zheng, Menglin Xia, Wen Xiao, and Jieyu Zhang for helpful discussions.\nREFERENCES\nEmily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the\ndangers of stochastic parrots : can language models be too big? In Proceedings of the 2021\nACM Conference on Fairness, Accountability, and Transparency , pp. 610–623, Virtual Event\nCanada, March 2021. ACM. ISBN 978-1-4503-8309-7. doi: 10.1145/3442188.3445922. URL\nhttps://dl.acm.org/doi/10.1145/3442188.3445922 .\nKathrin Blagec, Georg Dorffner, Milad Moradi, Simon Ott, and Matthias Samwald. A global anal-\nysis of metrics used for measuring performance in natural language processing. arXiv preprint\narXiv:2204.11574 , 2022.\nRishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx,\nMichael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportu-\nnities and risks of foundation models. arXiv preprint arXiv:2108.07258 , 2021.\nHarrison Chase. LangChain, October 2022. URL https://github.com/langchain-ai/\nlangchain .\nLingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while\nreducing cost and improving performance. arXiv preprint arXiv:2305.05176 , 2023.\nHyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi\nWang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language mod-\nels.arXiv preprint arXiv:2210.11416 , 2022.\nDujian Ding, Sihem Amer-Yahia, and Laks VS Lakshmanan. On efficient approximate queries over\nmachine learning models. arXiv preprint arXiv:2206.02845 , 2022.\nThomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The\nJournal of Machine Learning Research , 20(1):1997–2017, 2019.\nPengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert\nwith disentangled attention. arXiv preprint arXiv:2006.03654 , 2020.\nDan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and\nJacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint\narXiv:2009.03300 , 2020.\nGeoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv\npreprint arXiv:1503.02531 , 2(7), 2015.\nAri Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text\ndegeneration. arXiv preprint arXiv:1904.09751 , 2019.\nHuggingFace. Hugging face inference api. https://huggingface.co/inference-api .\nTouseef Iqbal and Shaima Qureshi. The survey: Text generation models in deep learning. Journal\nof King Saud University-Computer and Information Sciences , 34(6):2515–2528, 2022.\nBenoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard,\nHartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for\nefficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer\nvision and pattern recognition , pp. 2704–2713, 2018.\nDongfu Jiang, Xiang Ren, and Bill Yuchen Lin. Llm-blender: Ensembling large language models\nwith pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561 , 2023.\n10\nPublished as a conference paper at ICLR 2024\nAnil Kag, Igor Fedorov, Aditya Gangrade, Paul Whatmough, and Venkatesh Saligrama. Efficient\nedge inference by selective query. In The Eleventh International Conference on Learning Repre-\nsentations , 2022.\nSehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W Mahoney, Amir\nGholami, and Kurt Keutzer. Speculative decoding with big little decoder. In Thirty-seventh\nConference on Neural Information Processing Systems , 2023.\nBartosz Krawczyk. Learning from imbalanced data: open challenges and future directions. Progress\nin Artificial Intelligence , 5(4):221–232, 2016.\nYann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information\nprocessing systems , 2, 1989.\nYaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative\ndecoding. In International Conference on Machine Learning , pp. 19274–19286. PMLR, 2023.\nYang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg\nevaluation using gpt-4 with better human alignment, may 2023. arXiv preprint arXiv:2303.16634 ,\n2023.\nMarc Masana, Xialei Liu, Bartłomiej Twardowski, Mikel Menta, Andrew D Bagdanov, and Joost\nVan De Weijer. Class-incremental learning: survey and performance evaluation on image clas-\nsification. IEEE Transactions on Pattern Analysis and Machine Intelligence , 45(5):5513–5533,\n2022.\nGaurav Menghani. Efficient deep learning: A survey on making deep learning models smaller,\nfaster, and better. ACM Computing Surveys , 55(12):1–37, 2023.\nOpenAI. Openai gpt-3.5-turbo. https://platform.openai.com/docs/models/\ngpt-3-5 , a.\nOpenAI. Chatgpt. https://chat.openai.com/ , b.\nOpenAI. Openai platform. https://platform.openai.com/overview , c.\nOpenAI. Gpt-4 technical report, 2023.\nUsha Ruby and Vamsidhar Yendapalli. Binary cross entropy with deep learning technique for image\nclassification. Int. J. Adv. Trends Comput. Sci. Eng , 9(10), 2020.\nLuke Skyward. Gpt-4 is non-deterministic and moe is the likely rea-\nson why, Aug 2023. URL https://ai.plainenglish.io/\ngpt-4-is-non-deterministic-af202847529c .\nAarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam\nFisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri `a Garriga-Alonso, et al. Beyond the\nimitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint\narXiv:2206.04615 , 2022.\nZhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, and Zhihong Deng. Fast structured\ndecoding for sequence models. Advances in Neural Information Processing Systems , 32, 2019.\nRohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy\nLiang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.\nhttps://github.com/tatsu-lab/stanford_alpaca , 2023.\nHugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-\nlay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda-\ntion and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023.\nMarcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R Ciosici, Michael\nHassid, Kenneth Heafield, Sara Hooker, Colin Raffel, et al. Efficient methods for natural language\nprocessing: A survey. Transactions of the Association for Computational Linguistics , 11:826–\n860, 2023.\n11\nPublished as a conference paper at ICLR 2024\nAshish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\nŁukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural informa-\ntion processing systems , 30, 2017.\nRui Yao, Guosheng Lin, Shixiong Xia, Jiaqi Zhao, and Yong Zhou. Video object segmentation and\ntracking: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) , 11(4):1–47,\n2020.\nGyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A\ndistributed serving system for {Transformer-Based }generative models. In 16th USENIX Sympo-\nsium on Operating Systems Design and Implementation (OSDI 22) , pp. 521–538, 2022.\nWeizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text gener-\nation. Advances in Neural Information Processing Systems , 34:27263–27277, 2021.\nWayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min,\nBeichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv\npreprint arXiv:2303.18223 , 2023.\nLianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,\nZi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and\nchatbot arena. arXiv preprint arXiv:2306.05685 , 2023.\n12\nPublished as a conference paper at ICLR 2024\nA A DDITIONAL EXPERIMENTS\nA.1 R OUTER LATENCY\nWe measure the latency of our router and compare it to the latency of the different LLMs – Flan-t5\n(800m), Llama-2 (7b), and Llama-2 (13b) that we use in our experiments for generating responses\nto user queries. Note that the latency of all the routers rdet,rprob, andrtrans will be the same since\nthey use the same model (DeBERTa-v3-large (He et al., 2020)) and are just trained differently. Also,\nwe do not measure the latency of GPT-3.5-turbo since its responses are generated by querying the\nOpenAI API (OpenAI, c) as the model weights are not publicly available due to which it is not\npossible to disentangle the inference latency from the network latency, queueing delay, latency of\nthe API call, etc. However we note that the inference latency of all other LLMs we consider is\nsignificantly larger than that of the router (see Table 2) and therefore we expect the same to hold for\nGPT-3.5-turbo as well.\nThe latency results are reported in Table 2 where we measure the average latency per query averaged\nover 200 randomly chosen queries from our dataset (confidence bounds correspond to one standard\nerror). As expected the router processes queries significantly faster than all the LLMs (nearly 10×\nfaster than the fastest LLM – FLAN-t5(800m)). This is both due to its smaller size (300m param-\neters) and the fact that it performs a single forward pass over the query to generate the score while\nthe LLMs generate the response token-by-token in an autoregressive fashion due to which the infer-\nence latency is proportional to the response length. Thus the router adds minimal overhead to the\ninference cost due to its small size and extremely low latency.\nTable 2: Latency Values for Different Models.\nModel Latency (seconds)\nRouter 0.036±0.002\nFLAN-t5 (800m) 0.46±0.039\nLlama-2 (7b) 7.99±0.15\nLlama-2 (13b) 14.61±0.27\nA.2 E MPIRICAL DETERMINATION OFROUTING THRESHOLD\nRecall that at test time the model owner is required to set a threshold on the router score which\nserves to separate the easy queries from the hard ones (see Section 3). All queries with router score\nhigher than the threshold will be routed to the small model. Thus the threshold is a user-defined\nparameter controlling the achieved efficiency-performance trade-off, to best serve the interests of\ndifferent users. In this section we show how to empirically choose thresholds on router scores to\nachieve cost reduction with little to no performance drops. For this, we use a small calibration set\nto recommend default thresholds to users. We investigate all three routers rdet,rprob, and rtrans\nwith different LLM pairs that we use in our experiments. For each LLM pair, we randomly draw\n500samples from the validation set and use grid search to determine the threshold that delivers the\nhighest cost advantages i.e., cost savings on the validation set while keeping the performance drop\n(reduction in BART score) less than 1%. The limit on performance drop can be adjusted as per\nuser requirements. With the selected thresholds, we report the achieved performance drops and cost\nadvantages on the test sets, as summarized in Table 3.\nAs seen from the table the performance and the cost advantage obtained on the test sets closely\nfollows that on the validation sets for allcategories of LLM pairs. This clearly illustrates that a\nthreshold chosen on the validation set generalizes well to the test set. We note that there is a slight\nincrease in the performance drop from the validation to the test set for the LLama-2 (7b) and Llama-\n2 (13b) pair, i.e the LLM pair with small performance gap as per the categorization in Section 4.\nHowever this is also the pair with the highest cost advantage or cost savings ( >96% for all routers)\nand thus the issue can be addressed by just using a more conservative limit on the performance drop\nwhile choosing the threshold which would still lead to very significant cost savings.\n13\nPublished as a conference paper at ICLR 2024\nTable 3: Test performance drops v.s. cost advantages achieved by thresholds chosen from 500\nvalidation samples with ≤1%sampled performance drops.\nRouterS: Llama-2 (7b)\nL: Llama-2 (13b)S: Llama-2 (13b)\nL: GPT-3.5-turboS: FLAN-t5 (800m)\nL: Llama-2 (13b)\nPerf. Drop Cost Adv. Perf. Drop Cost Adv. Perf. Drop Cost Adv.\nrdetVal. 0.99% 98.20% 0.97% 15.20% 0.77% 5.40%\nTest 1.60% 98.56% 0.55% 15.15% 0.69% 4.89%\nrprobVal. 0.92% 97.60% 0.56% 8.60% 0.70% 5.00%\nTest 1.42% 96.80% 0.11% 8.38% 0.57% 4.44%\nrtransVal. 0.79% 96.00% 0.77% 17.00% 0.92% 4.00%\nTest 1.39% 96.45% 0.49% 15.68% 1.02% 5.05%\n(a) High correlation ( r= 0.46, ρ=\n0.44).\n(b) Medium correlation ( r=\n0.38, ρ= 0.38).\n(c) Low correlation ( r= 0.26, ρ=\n0.27).\nFigure 7: Routing performance evaluated with GPT-4 scores. Pearson ( r) and spearman ( ρ) correla-\ntion coefficients between quality gaps measured by BART score and GPT-4 score are computed for\neach LLM pair.\nA.3 A LTERNATE EVALUATION METRICS\nTo provide a more comprehensive evaluation of our routers, we test the routing performance with\nmetrics in addition to BART score (Yuan et al., 2021). GPT-4-based evaluators have been found\nto be well correlated with human assessments (Liu et al., 2023; Chase, 2022). We generate GPT-4\nevaluation scores (integer ratings from 1to10) for test responses from Flan-t5 (800m), Llama-2 (7b),\nLlama-2 (13b), and GPT-3.5-turbo that we investigate in our experiments, using LangChain scoring\nevaluator (Chase, 2022). Recall that our routers are trained with BART score due to efficiency and\neffectiveness reasons as discussed in Section 2.3. Intuitively, if the quality gaps measured by BART\nscore and GPT-4 score are highly correlated, we could expect good routing performance even under\nthe GPT-4 score as we have seen in Section 4.2. We compute the correlatio2025-02-19T21:18:54.080541709Z n between quality gaps\nmeasured by BART score and GPT-4 score, and report it along with routing performance evaluated\nwith GPT-4 score, as shown in Figure 7.\nAligned with our intuition, when the two metrics are well correlated (Figure 7a), our routers trained\nwith BART score are still effective even when evaluated against GPT-4 score. Typically, rdet,rprob,\nandrtrans are able to achieve 20% cost advantage with up to 1%performance drop, and 40% cost\nadvantage with up to 2.1%performance drop. As the correlation gets weaker, the router performance\ngradually decays, as shown in Figure 7b and 7c. This observation suggests a simple-yet-effective\nstrategy of using BART score in practice to save labelling costs while maintaining routing perfor-\nmance. We can first compute the correlation between BART score and the target metrics (e.g.,\nhuman assessments) using a small sample and use BART score as training labels whenever there is\nstrong positive correlation with target metrics.\n14\nPublished as a conference paper at ICLR 2024\n(a) High correlation ( r= 0.76, ρ=\n0.71).\n(b) Medium correlation ( r=\n0.55, ρ= 0.50).\n(c) Low correlation ( r= 0.06, ρ=\n0.02).\nFigure 8: Routing performance on the testing LLM pairs that are different than the pairs routers\nwere trained with. Pearson ( r) and spearman ( ρ) correlation coefficients between quality gaps of the\ntraining and testing LLM pairs are computed for each setting.\nA.4 G ENERALIZING TODIFFERENT MODEL PAIRS\nWe evaluate the generalizability of our routers by testing their routing performance on LLM pairs\ndifferent than the pairs they were trained with. We compute the correlation between quality gaps of\ntraining and testing LLM pairs, and report it along with routing performance, as shown in Figure 8.\nSimilar to our observation in Section A.3, our routers can generalize well if the quality gaps of\ntesting LLM pairs exhibit strong positive correlation with the quality gaps of the training pairs. In\nFigure 8a, both pearson and spearman correlation coefficients exceed 0.7, and all three routers are\nable to achieve 20% cost advantage with up to 1.6%performance drop, and 40% cost advantage with\nup to 4.1%performance drop. As the correlation becomes weaker, the generalizability of our router\ngets restricted and routing performance decays, as shown in Figure 8b and 8c. This observation\nsheds light on using the quality gap correlation as an effective indicator to decide if our routers can\nbe applied to new LLM pairs in the early stage. Given a pair of LLMs (source pair) and a router\ntrained on this pair we can measure the correlation between the quality gap of the source pair and\nthe quality gap of any new target pair of LLMs to decide if the router will be effective on the target\npair.\nA.5 M ORE ROUTER PERFORMANCE RESULTS\nIn this section, we provide more routing evaluation results on 4LLM pairs. Typically, as shown\nin Figure 9, the FLAN-t5 (800m) v.s. FLAN-t5 (11b) pair is another example of small perfor-\nmance gaps , the Llama-2 (7b) v.s. GPT-3.5-turbo pair can be characterized as being of medium\nperformance gaps , while the routing between GPT-3.5-turbo and two FLAN-t5 models is of large\nperformance gaps . Qualitative comparison results are summarized in Table 4, which resemble our\nanalysis in Section 4.2. In general, rdetis able to achieve noticeable cost advantages with little to no\nperformance drop when the the cost advantage targets are low or the performance gaps are small. As\nthe routing becomes challenging, the improvements over rdetby having rprobbecome considerable\nandrtrans starts to dominate the competition.\nWe also provide the quality gaps difference for each routing scenario, as shown in Figure 10, where\nwe consistently observe that our routing strategy correctly identifies easy queries and route them to\nthe small model.\nB D ATASET STATISTICS\nWe uniformly sample 10k training examples from the training split of MixInstruct , for each of which\nwe generate 10responses from all investigated LLMs. Our validation and test splits are the same\nas the MixInstruct dataset, which consist of 5k instruction examples separately. Our dataset in total\nconsists of 20k instruction examples, as shown in Table 5.\n15\nPublished as a conference paper at ICLR 2024\nFigure 9: Error-cost tradeoffs achieved by rdet,rprob, and rtrans when the small and large models\nare of different performance gaps.\n16\nPublished as a conference paper at ICLR 2024\nTable 4: Cost advantage v.s. Performance drop.\nCost\nAdvantage\n(%)Performance Drop (%)\nS: FLAN-t5 (800m) S: Llama-2 (7b) S: FLAN-t5 (800m) S: FLAN-t5 (11b)\nL: FLAN-t5 (11b) L: GPT-3.5-turbo L: GPT-3.5-turbo L: GPT-3.5-turbo\nrdet rprob rtrans rdet rprob rtrans rdet rprob rtrans rdet rprob rtrans\n10 -0.2 -0.3 -0.2 0.3 0.3 0.6 3.6 3.4 3.6 3.8 3.6 3.3\n20 -0.2 -0.2 -0.2 1.5 1.3 1.2 8.7 8.7 7.9 9.0 8.9 7.3\n40 0.0 -0.1 0.0 4.1 4.1 3.6 19.3 19.8 17.4 19.2 19.4 16.5\nFigure 10: Average quality gap difference between queries routed to the small and large models with\ndifferent performance gaps.\nC Q UERY HARDNESS AND BART SCORE\nC.1 Q UERY HARDNESS : A C ASE STUDY\nIn this section, we demonstrate the query hardness with real-world examples from our dataset. We\nchoose the FLAN-t5 (11b) v.s. GPT-3.5-turbo routing pair for illustration purpose.\nC.1.1 E ASY QUERY EXAMPLE\nInput Query : “Rewrite the sentence so that it’s in the present tense. She had worked at the company\nfor the past 3 years ”\nLLM Responses :\n•FLAN-t5 (11b) : “She works at the company for the past 3 years.”\n•GPT-3.5-turbo : “She works at the company for the past 3 years.”\n17\nPublished as a conference paper at ICLR 2024\nTable 5: Dataset Statistics.\nSources #Examples Source of Responses\nAlpace-GPT4 4,179 GPT-4\nDolly-15K 1,381 Human\nGPT4All-LAION 13,547 chatGPT\nShareGPT 567 chatGPT\nTotal 20K Mix of Above\nIn this example, the small model (i.e., FLAN-t5 (11b)) is able to generate the same high quality\nresponse as the large model (i.e., GPT-3.5-turbo), which makes it an easy query by definition.\nC.1.2 H ARD QUERY EXAMPLE\nInput Query : “What are the benefits of having a dog in the family?”\nLLM Responses :\n•FLAN-t5 (11b) : “A dog is good for physical activity.”\n•GPT-3.5-turbo : “Having a dog in the family can have numerous benefits, including:\n1. Companionship: Dogs provide unconditional love and companionship, offering constant\nemotional support and a sense of belonging.\n2. Improved mental health: Being around dogs has been proven to reduce stress, anxiety,\nand depression. They can also help alleviate loneliness and provide a sense of purpose.\n3. Increased physical activity: Dogs require regular exercise, which encourages family\nmembers to engage in physical activities such as walking, running, and playing. This\npromotes a healthier lifestyle for both the dog and the family.\n4. Teaching responsibility: Caring for a dog requires feeding, grooming, training, and\nregular vet visits. Having a dog in the family can teach children valuable life skills and\ninstill a sense of responsibility.\n5. Socialization: Dogs are known to be social creatures. Walking a dog or visiting dog\nparks can provide opportunities for social interaction and help family members connect\nwith other dog owners in the community.\n6. Teaching empathy and compassion: Nurturing and caring for a dog teaches children\nempathy, compassion, and the importance of kindness towards animals.\n7. Home security: Dogs are natural protectors and can act as an early warning system,\nalerting the family to potential intruders or dangers.\n8. Health benefits: Interacting with dogs, such as petting or playing, has been shown to\nlower blood pressure, decrease the risk of allergies and asthma in children, and boost the\nimmune system.\n9. Improved family dynamics: The presence of a dog can strengthen family bonds by\nproviding shared experiences, fostering teamwork, and creating a common focus of love\nand care.\nIt is important to note that owning a dog requires time, effort, and financial responsibil-\nity. Families should carefully consider their lifestyle, resources, and commitment before\ndeciding to bring a dog into their home.”\nIn this example, the response from the small model (i.e., FLAN-t5 (11b)) is clearly uninformative\nand unpolished, while the large model (i.e., GPT-3.5-turbo) is able to provide comprehensive and\nwell-constructed arguments. We identify this example as a hard query.\nC.2 BART SCORE : A CASE STUDY\nIn this section, we empirically examine the correlation between BART scores and human judgments.\nWe take the query “ Where is the A ´eronef A-1 Skyraider built? ” as example to which the expected\n18\nPublished as a conference paper at ICLR 2024\nanswer is “ The A ´eronef A-1 Skyraider was primarily built by the Douglas Aircraft Company\nin El Segundo, California. ” We report the 5random responses from Llama-2 (13b) appended with\nthe corresponding BART scores, as shown below.\nResponses from Llama-2 (13b):\n1. Where is the A ´eronef A-1 Skyraider built? The A ´eronef A-1 Skyraider was built by the\nDouglas Aircraft Company in El Segundo, California, USA. ( -0.927 )\n2. Where is the A ´eronef A-1 Skyraider built? A-1 Skyraider is built by Douglas Aircraft\nCompany, located in El Segundo, California, USA. ( -1.12 )\n3. Where is the A ´eronef A-1 Skyraider built? The A ´eronef A-1 Skyraider was built in the\nUnited States by Douglas Aircraft Company. ( -1.63 )\n4. Where is the A ´eronef A-1 Skyraider built? The A ´eronef A-1 Skyraider was built in the\nUnited States. ( -2.25 )\n5. Where is the A ´eronef A-1 Skyraider built? The A ´eronef A-1 Skyraider is a fictional aircraft,\nso it is not built anywhere as it does not exist in reality. ( -2.93 )\nAs shown above, the best response of the highest BART score −0.927provides all necessary in-\nformation such as the company name and the city name. As the BART scores decrease, it can be\nobserved that the quality of corresponding responses starts degrading as well. For example, the 3-th\nresponse does not include the city information, the 4-th response further misses the company name,\nwhile the last one is completely wrong. This example empirically justifies the effectiveness of using\nBART scores as the response quality metric.\n19', 'content_summary': 'Published as a conference paper at ICLR 2024\nHYBRID LLM: C OST-EFFICIENT AND QUALITY -\nAWARE QUERY R...', 'content_length': 58372, 'chunks_count': -1, 'status': <DocStatus.PENDING: 'pending'>},
2025-02-19 22:18:54 error:invalid byte sequence for encoding "UTF8": 0x00
2025-02-19 22:18:54 ERROR:Error processing or enqueueing file 7318_Hybrid_LLM_Cost_Efficient.pdf: invalid byte sequence for encoding "UTF8": 0x00
2025-02-19 22:18:54 ERROR:Traceback (most recent call last):
2025-02-19 22:18:54   File "/usr/local/lib/python3.11/site-packages/lightrag/api/lightrag_server.py", line 1213, in pipeline_enqueue_file
2025-02-19 22:18:54     await rag.apipeline_enqueue_documents(content)
2025-02-19 22:18:54   File "/usr/local/lib/python3.11/site-packages/lightrag/lightrag.py", line 764, in apipeline_enqueue_documents
2025-02-19 22:18:54     await self.doc_status.upsert(new_docs)
2025-02-19 22:18:54   File "/usr/local/lib/python3.11/site-packages/lightrag/kg/postgres_impl.py", line 632, in upsert
2025-02-19 22:18:54     await self.db.execute(
2025-02-19 22:18:54   File "/usr/local/lib/python3.11/site-packages/lightrag/kg/postgres_impl.py", line 184, in execute
2025-02-19 22:18:54     await connection.execute(sql, *data.values())  # type: ignore
2025-02-19 22:18:54     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-19 22:18:54   File "/usr/local/lib/python3.11/site-packages/asyncpg/connection.py", line 352, in execute
2025-02-19 22:18:54     _, status, _ = await self._execute(
2025-02-19 22:18:54                    ^^^^^^^^^^^^^^^^^^^^
2025-02-19 22:18:54   File "/usr/local/lib/python3.11/site-packages/asyncpg/connection.py", line 1864, in _execute
2025-02-19 22:18:54     result, _ = await self.__execute(
2025-02-19 22:18:54                 ^^^^^^^^^^^^^^^^^^^^^
2025-02-19 22:18:54   File "/usr/local/lib/python3.11/site-packages/asyncpg/connection.py", line 1961, in __execute
2025-02-19 22:18:54     result, stmt = await self._do_execute(
2025-02-19 22:18:54                    ^^^^^^^^^^^^^^^^^^^^^^^
2025-02-19 22:18:54   File "/usr/local/lib/python3.11/site-packages/asyncpg/connection.py", line 2024, in _do_execute
2025-02-19 22:18:54     result = await executor(stmt, None)
2025-02-19 22:18:54              ^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-19 22:18:54   File "asyncpg/protocol/protocol.pyx", line 206, in bind_execute
2025-02-19 22:18:54 asyncpg.exceptions.CharacterNotInRepertoireError: invalid byte sequence for encoding "UTF8": 0x00

Additional Information

  • LightRAG Version: Master
  • Operating System: Docker
  • Python Version: 3.10
  • Related Issues: NA

YanSte avatar Feb 19 '25 21:02 YanSte