skipper icon indicating copy to clipboard operation
skipper copied to clipboard

optimize Lua pool size

Open szuecs opened this issue 3 years ago • 8 comments

sets pool size dependent on memory available and keep a max PoolSize of N (TBD).

PoolSize 1000

tested with 2 routes (minimal lua)

% cat examples/script_mini.eskip
r1: Path("/foo") -> disableAccessLog(2) -> inlineContent("OK") -> <shunt>;
r0: * -> disableAccessLog(2) -> status(200) -> lua(`
function request(c)
    c.request.url_path = "/bar"
    -- print(c.request.url_path)
end
function response(c)
    c.response.header["x-foo"]= "hello"
end
`)
-> <shunt>;

Test command:

for rate in 250 500 750 1000 1500 2000 2500 3000
do
  echo "## $rate test with Lua"
  echo "GET http://127.0.0.1:9999/" | vegeta attack -duration=10s -rate=$rate/1s | vegeta report 
done
for rate in 250 500 750 1000 1500 2000 2500 3000
do 
  echo "## $rate test without Lua"
  echo "GET http://127.0.0.1:9999/foo" | vegeta attack -duration=10s -rate=$rate/1s | vegeta report
done

route without Lua

route with Lua

baseline with Lua before change

tested with 2 routes (lua computation)

1: Path("/foo") -> disableAccessLog(2) -> inlineContent("OK") -> <shunt>;
r0: * -> disableAccessLog(2) -> status(200) -> lua(`
function request(c)
    local charset = "abcdefghijklmnopqrstuvwhyz"
    local s = ''
    for i = 1, 5 do
        local r = math.random(1, #charset)
        s = s .. charset:sub(r, r)
    end
    c.request.url_path = s
    -- print(c.request.url_path)
end
function response(c)
    local charset = "abcdefghijklmnopqrstuvwhyz"
    local s = ''
    for i = 1, 5 do
        local r = math.random(1, #charset)
        s = s .. charset:sub(r, r)
    end
    c.response.header["x-foo"]= s
end
`)
-> <shunt>;

route without Lua


route with Lua


baseline with Lua before change

Signed-off-by: Sandor Szücs [email protected]

szuecs avatar Jun 02 '21 18:06 szuecs

I think we need to better understand how well or not the pool actually behaves before making it more complex. E.g. I have added a log statement to see when new state is allocated (which presumably introduces latency)

diff --git a/script/script.go b/script/script.go
index 4b14682e..68a6d83c 100644
--- a/script/script.go
+++ b/script/script.go
@@ -77,6 +77,7 @@ func (s *script) getState() (*lua.LState, error) {
                }
                return L, nil
        default:
+               log.Info("newState")
                return s.newState()
        }

and run bin/skipper -inline-routes='* -> lua("function request(c,p) sleep(1000) end") -> status(204) -> <shunt>' -access-log-disabled - a lua script with 1s latency. Then I put some load via echo "GET http://localhost:9090/test" | vegeta attack -rate=100/s | vegeta report -every 1s and indeed observe 1s latency:

Duration      [total, attack, wait]             56.981s, 55.98s, 1.001s
Latencies     [min, mean, 50, 90, 95, 99, max]  1s, 1.001s, 1.001s, 1.002s, 1.002s, 1.003s, 1.01s
Bytes In      [total, mean]                     0, 0.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      204:5599  
Error Set:

What I do not understand though is that with 100 rps and 1s latency there should be 100 requests in parallel which with the default state pool size of 10 should create new states but I do not observe that. The pool is implemented as a go channel, maybe there is something interesting goes on when many goroutines read/write to it in parallel.

The PR introduces some assumptions in form of constants and a bit of magic. I also think it would be better to compare baseline against the changed version to see if it actually improves. Also I think tests should run for a longer period of time.

AlexanderYastrebov avatar Jun 10 '21 16:06 AlexanderYastrebov

that's true. Right now this is WIP. We just discussed that we should start with converting the pool/filter instance setup to a shared global pool on the spec, where we can access the Lua states via a key, eg a hash of the code.

aryszka avatar Jun 10 '21 17:06 aryszka

@AlexanderYastrebov I agree, it's just trying things as a wip and the pool should be bounded and different. We can even pre allocate a generic pool and load the code when script instance is created. One thing to figure out is how to return to the generic pool. We would need a Close() on the filter and give it back removing the code of the filter. I think we have no Close() that we call if we delete a route for example.

szuecs avatar Jun 10 '21 17:06 szuecs

What I do not understand though is that with 100 rps and 1s latency there should be 100 requests in parallel which with the default state pool size of 10 should create new states but I do not observe that.

Now I get it - in steady state with constant request rate the actual rate and latency do not matter - the states are drained from the pool at the same rate they are returned so there is no need to create new states. The pool size seems to matter when rate grows/jitters but I have yet to figure out the dependency.

AlexanderYastrebov avatar Jun 10 '21 18:06 AlexanderYastrebov

In my tests on Friday (code changes not pushed, I could observe more creates at >3500 rps, but my optimization is a slow down and the only thing I would do is to increase the pool size, maybe we should pass the poolsize per filter via create function and passed via cli arg and config. If you have an instance in the pool it's crazy fast for an interpreted language that we hook in!

szuecs avatar Jun 12 '21 19:06 szuecs

maybe we should pass the pool size per filter

We have reserved parameter names starting with "lua-" documented

Any parameter starting with "lua-" should not be used to pass values for the script - those will be used for configuring the filter.

so technically we may use it to set min and max pool size. Max pool size we may also be capped by startup config value (if/when needed). The problem with "lua-" filter parameters is that once someone uses it - we would not be able to break it.

Previously I thought min pool size did not make much sense but now I think it does - it allows pool to grow to optimal size wrt request rate pattern on the given route. With this in mind maybe we do not need per-filter config but a global config to set max pool size per filter is enough.

One thing to figure out is how to return to the generic pool.

We have an unsolved ticket for filter cleanup https://github.com/zalando/skipper/issues/202 We currently do not close states in the pool on route updates. Should not be a big deal unless someone uses iolib to create temp files which are removed on state closing.

We just discussed that we should start with converting the pool/filter instance setup to a shared global pool on the spec, where we can access the Lua states via a key, eg a hash of the code.

Sound like global pool should be some kind of LRU cache or bounded priority queue but it is not immediately obvious how would it look like as it should support multiple values per key. What I like about current per-filter pool implementation is how simple it is.

AlexanderYastrebov avatar Jun 12 '21 21:06 AlexanderYastrebov

After testing, I am all in with per filter pool. Right now it's fixed in size, but maybe we should let it grow. Grow the current pool seems not really possible right now because the size is the buffered channel size and I don't think it is easy to grow. Maybe we should use another data structure for this. Any ideas?

Passing pool size in filters should be only allowed in reasonable bounds, otherwise we have the same problem as with in-memory ratelimit filters: clientRatelimt/ratelimit

szuecs avatar Jun 12 '21 21:06 szuecs

How to block Lua VM? To test growing pool size and how the limiting of number of Lua VMs work, right now vs. with the pre-allocated number of VMs.

szuecs avatar Aug 24 '22 14:08 szuecs