openlimit icon indicating copy to clipboard operation
openlimit copied to clipboard

Community FYI: Maintained Fork with Major Improvements (OpenLimit v2 Alpha)

Open Elijas opened this issue 9 months ago • 4 comments

I have forked the repo and implemented an improved version, currently considering how to best release it

✅ Already implemented in v2:

  • ✅ Consider implementing a mechanism to "refund" unused response tokens. In a worst-case scenario, with max_output_tokens=120_000, ten requests could prematurely consume the entire 1M token quota—even if the actual responses were much shorter. This leads to significant throughput inefficiency and undermines the core purpose of the library if it's not properly accounted for.

    • https://github.com/shobrook/openlimit/issues/6#issuecomment-1648750433
  • ✅ Add support to limit Request TPM and Response TPM separately (Anthropic API)

  • ✅ Add support to set per-model token counters (for non-OpenAI models)

  • ✅ More convenient per-model management

    • https://github.com/shobrook/openlimit/compare/master...blackbirdai-team:openlimit:master
  • ✅ The old approach assumed the entire cost (tokens and request count) is incurred instantaneously at the start of the operation. This is fixed by refund correctly (conservatively) setting the usage timestamp at time of last token generated (that's why you should use "refund()" even if nothing is refunded)

  • ✅ Time limiting across multiple time horizons (e.g. limit resource consumption both per minute and per hour, of the same or of different resources)

  • ✅ More intuitive/simple/streamlined API

    • ✅ Use 60 seconds per bucket as default (currently it's =1)
      • https://github.com/shobrook/openlimit/issues/8
  • ✅ Make token or request limiting optional (e.g. if the user wants to track only RPM)

    • https://github.com/shobrook/openlimit/compare/master...williamxhero:openlimitHW:master
  • ✅ Fix any leftover bugs (audit and fix the source code for race conditions, if any)

    • https://github.com/shobrook/openlimit/issues/9
  • ✅ Raise Error if a given input exceeds the available bucket size (I think this is already implemented, would have to check)

    • https://github.com/shobrook/openlimit/issues/12#issuecomment-1815654446
  • ✅ Add type hints

    • https://github.com/shobrook/openlimit/compare/master...oyarsa:openlimit:master

Potential improvements

  • Integrate model_prices_and_costs.json to rate-limit per-cost (e.g. no more than 1$ per minute for example on top of limiting TPM and RPM)
  • Verify that API is thread-safe (it already should be though)
  • Use modern Python dev stack (uv, ruff, etc.)
    • https://github.com/shobrook/openlimit/compare/master...oyarsa:openlimit:master
  • Setup auto CI/CD deploy to PyPI (currently, it's out of sync)
    • https://github.com/shobrook/openlimit/issues/13#issuecomment-2322764992

Consider if there's demand for it

  • Implement Redis non-async (sync) backend
    • https://github.com/shobrook/openlimit/compare/master...Diamondy4:openlimit:master
  • Make Redis an optional dependency (pip install openlimit[redis])
    • https://github.com/shobrook/openlimit/compare/master...dragoneyeAI:openlimit-lite:master

cc @shobrook

Elijas avatar Mar 30 '25 12:03 Elijas

I suspect that there's no wider demand for this functionality but if anyone would like to use it, just drop a message below, and I'll release the preview as open-source

Elijas avatar Mar 30 '25 20:03 Elijas

Would be great if you opened a PR - It would be pretty helpful to have this in openlimit

themichaelusa avatar Apr 06 '25 20:04 themichaelusa

PR might not be the best idea because:

  1. openlimit is not maintained anymore
  2. Some of the changes are not compatible due to the need to (but could be made easily backwards compatible with an adapter, if needed)
  3. Some of the feature set is temporariliy reduced (only async methods and only Redis backend are available at the moment)
  4. Also, not enough coverage with automated tests, so treat this more as a early rpreview

More demos: https://gist.github.com/justinvanwinkle/d9f04950083c4554835c1a35f9d22dad?permalink_comment_id=5527679#gistcomment-5527679

Would you still be interested in using it? I could release it in a separate repo (didn't do it yet, cause it's extra work for no benefit if I'm the only user :) )

Features

:white_check_mark: Atomic operations with Semaphores to avoid race conditions :white_check_mark: Ability to limit multiple resources at once, such as both Requests and LLM tokens (in race-condition safe way) :white_check_mark: Ability to limit the same resources across multiple time windows, e.g. Requests per minute and Requests per day :white_check_mark: Ability to refund/correct actual usage, such as refund unused tokens :white_check_mark: You can limiting across multiple providers (you can use the same limiter and set different quotas for "gpt-4o", "gpt-4o-mini" if needed), or even set quotas dynamically (quota will be calculated with a callback based on passed "used model" and doesnt have to be provided in advance)

Demo

import time

quotas = [
    Quota(metric="oranges", limit=1, per_seconds=1),
    Quota(metric="bananas", limit=4, per_seconds=8),
    Quota(metric="bananas", limit=4, per_seconds=999999),
]

limiter = create_limiter(quotas, backend=redis, callbacks=loguru_callbacks)

total = 0
start_time = time.time()

for _ in range(10):
    usage_per_request = {
        "oranges": 1,
        "bananas": 2,
    }
    reservation = await limiter.acquire_capacity(
        usage=usage_per_request, model="gpt-123"
    )

    total += 1

    actual_usage = {
        "oranges": 1,
        "bananas": 1,
    }
    await limiter.refund_capacity(actual_usage, reservation)

    elapsed_time = time.time() - start_time
    print(f"{total=}, elapsed_time={elapsed_time:.4f}")
ℹ️ 19:59:11.066 Rate limiter missing consumption data, assuming quota was previously unused model_family=gpt-123 usage_metric=bananas per_seconds=8  
ℹ️ 19:59:11.066 Rate limiter missing consumption data, assuming quota was previously unused model_family=gpt-123 usage_metric=bananas per_seconds=999999
ℹ️ 19:59:11.066 Rate limiter missing consumption data, assuming quota was previously unused model_family=gpt-123 usage_metric=oranges per_seconds=1  
🐞 19:59:11.067 Rate limiter capacity consumed model_family=gpt-123 usage=oranges=1.0 bananas=2.0 preconsumption_capacities=('bananas', 8)=4.0 ('bananas', 999999)=4.0 ('oranges', 1)=1.0 postconsumption_capacities=('bananas', 8)=2.0 ('bananas', 999999)=2.0 ('oranges', 1)=0.0 current_time=1743785951.063672 
🐞 19:59:11.071 Rate limiter capacity refunded model_family=gpt-123 reserved_usage=oranges=1.0 bananas=2.0 actual_usage=oranges=1.0 bananas=1.0 refunded_usage=oranges=0.0 bananas=1.0 prerefund_capacities=('bananas', 8)=2.0029149055480957 ('bananas', 999999)=2.000000023319268 ('oranges', 1)=0.005829811096191406 postrefund_capacities=('bananas', 8)=3.0029149055480957 ('bananas', 999999)=3.000000023319268 ('oranges', 1)=0.005829811096191406 
total=1, elapsed_time=0.0149
🐞 19:59:11.074 Rate limiter wait starting model_family=gpt-123 usage=oranges=1.0 bananas=2.0 preconsumption_capacities=('bananas', 8)=3.0046974420547485 ('bananas', 999999)=3.000000037579574 ('oranges', 1)=0.00939488410949707  
🐞 19:59:12.146 Rate limiter capacity consumed model_family=gpt-123 usage=oranges=1.0 bananas=2.0 preconsumption_capacities=('bananas', 8)=3.54066002368927 ('bananas', 999999)=3.000004325284515 ('oranges', 1)=1.0 postconsumption_capacities=('bananas', 8)=1.54066002368927 ('bananas', 999999)=1.000004325284515 ('oranges', 1)=0.0 current_time=1743785952.144992 
🐞 19:59:12.147 Rate limiter wait complete model_family=gpt-123 usage=oranges=1.0 bananas=2.0 preconsumption_capacities=('bananas', 8)=3.54066002368927 ('bananas', 999999)=3.000004325284515 ('oranges', 1)=1.0 postconsumption_capacities=('bananas', 8)=1.54066002368927 ('bananas', 999999)=1.000004325284515 ('oranges', 1)=0.0 wait_time_s=1.0750501155853271 
🐞 19:59:12.150 Rate limiter capacity refunded model_family=gpt-123 reserved_usage=oranges=1.0 bananas=2.0 actual_usage=oranges=1.0 bananas=1.0 refunded_usage=oranges=0.0 bananas=1.0 prerefund_capacities=('bananas', 8)=1.542686939239502 ('bananas', 999999)=1.0000043414998556 ('oranges', 1)=0.004053831100463867 postrefund_capacities=('bananas', 8)=2.542686939239502 ('bananas', 999999)=2.0000043414998556 ('oranges', 1)=0.004053831100463867
total=2, elapsed_time=1.0941
🐞 19:59:12.154 Rate limiter wait starting model_family=gpt-123 usage=oranges=1.0 bananas=2.0 preconsumption_capacities=('bananas', 8)=2.544639468193054 ('bananas', 999999)=2.000004357120103 ('oranges', 1)=0.00795888900756836
🐞 19:59:13.229 Rate limiter capacity consumed model_family=gpt-123 usage=oranges=1.0 bananas=2.0 preconsumption_capacities=('bananas', 8)=3.081404447555542 ('bananas', 999999)=2.000008651244232 ('oranges', 1)=1.0 postconsumption_capacities=('bananas', 8)=1.081404447555542 ('bananas', 999999)=8.65124423210517e-06 ('oranges', 1)=0.0 current_time=1743785953.226481
🐞 19:59:13.230 Rate limiter wait complete model_family=gpt-123 usage=oranges=1.0 bananas=2.0 preconsumption_capacities=('bananas', 8)=3.081404447555542 ('bananas', 999999)=2.000008651244232 ('oranges', 1)=1.0 postconsumption_capacities=('bananas', 8)=1.081404447555542 ('bananas', 999999)=8.65124423210517e-06 ('oranges', 1)=0.0 wait_time_s=1.0791480541229248 
🐞 19:59:13.235 Rate limiter capacity refunded model_family=gpt-123 reserved_usage=oranges=1.0 bananas=2.0 actual_usage=oranges=1.0 bananas=1.0 refunded_usage=oranges=0.0 bananas=1.0 prerefund_capacities=('bananas', 8)=1.0846229791641235 ('bananas', 999999)=8.676992510722102e-06 ('oranges', 1)=0.006437063217163086 postrefund_capacities=('bananas', 8)=2.0846229791641235 ('bananas', 999999)=1.0000086769925107 ('oranges', 1)=0.006437063217163086  
total=3, elapsed_time=2.1791
🐞 19:59:13.240 Rate limiter wait starting model_family=gpt-123 usage=oranges=1.0 bananas=2.0 preconsumption_capacities=('bananas', 8)=2.0870444774627686 ('bananas', 999999)=1.0000086963645165 ('oranges', 1)=0.011280059814453125 

Note

You can avoid using Redis backend and switch to another implementation by passing your own backend

class RateLimiterBackend(ABC):
    @abstractmethod
    async def await_for_capacity(self, usage: dict[str, float]) -> None: ...

    @abstractmethod
    async def refund_capacity(
        self,
        reserved_usage: dict[str, float],
        actual_usage: dict[str, float],
    ) -> None: ...

Elijas avatar Apr 07 '25 09:04 Elijas

@themichaelusa currently busy with other stuff, but in the meantime can at least share the source code as it is for now. I released it in a github repo and a pypi package pip install multi-resource-limiter

https://github.com/Elijas/multi-resource-limiter

let me know if it works for you, I'll do my best to help if it doesn't

Elijas avatar Apr 21 '25 09:04 Elijas