otter
otter copied to clipboard
can someone please do a benchmark on core utilization. e.g. 48, 196 cpu core, 384 cpu core comparison with others
curious on the lock contention and performance on multicore systems. the original s3-fifo measured up to 16cores only.
I once tested on a virtual machine with 32 cores, but it didn't give very interesting results.
@kolinfluence Do you mind me asking about the use case? We are happy to further optimize for scalability if needed.
@1a1a11a mostly as lru @maypok86 possible to have an mmap version feature for golang inter process use? (may also be used by other languages)
Oh, it's not clear at all why this is needed in the onheap cache. Moreover, this will most likely require writing specific code for different operating systems. Perhaps in the future I will add the ability to pass a custom allocator, but I'm not sure that this will be useful to many users.
type Allocator interface {
Alloc(size uint32) ([]byte, error)
Free(b []byte) error
}
@maypok86 just linux ubuntu 24.04 will do :)
possible to do an mmap version? https://github.com/phuslu/lru/issues/15
- This is usually not about the onheap cache.
- It is much more correct to architecturally use abstraction and allow the user to implement allocation the way he needs. At least because of the existence of cgo and arenas.
- It is unclear why invent another freecache, bigcache, fastcache, etc. Of course, they have their own problems, but they are still good solutions.
- Why do you need an offheap cache at all? Have you thought about consistency and replication? Why not offheap from another language? Why is there no dedicated cache server?
I'm similar experiences for it -- In some server applications(e.g. bidding), the priority of throughputs and P99 latency stability is most important. the memory usage and hit ratio of the in-memory cache is a secondary consideration, and the dedicated cache server is last option.
That's a partial reason that I used fastcache before.
@maypok86
- mmap will solve cache sharing between processes in golang and other languages
- no comments. but we are looking for performance out of golang
- they dont support mmap and be used across different independent processes.
- dedicated cache server (e.g. redis like etc) goes through tcp (50k req/s) / uds (90k req/s) even shmipc (800k req/s) BUT mmap does 8mil req/s comparitively speaking for 64bytes transfer. the larger the bytes transfer per key value, the higher the disparity.
@phuslu i used fastcache too but now phuslu/lru. anyway, need mmap option for inter process cache sharing.
phuslu lru is more evenly balanced between reads and writes but otter is unbeatable in read speed but write speed is 1/3 of performance of phuslu
e.g. 600k req/s both read and write for phuslu 1mil req/s on read for otter and 280k write req/s.
otter can be good for cdn but phuslu for more general purpose.
@phuslu
In some server applications(e.g. bidding), the priority of throughputs and P99 latency stability is most important.
Yes, this is quite common.
the memory usage and hit ratio of the in-memory cache is a secondary consideration.
I probably agree about the memory usage. The hit ratio has a very strong effect on throughput and P99 latency, since usually a cache miss requires a much longer wait time. Moreover, with a high probability, P99 latency will be completely dependent on the speed of processing the cache miss by another resource. The hit ratio affects throughput simply due to the fact that with an increase in the number of cache misses, the external resource will have to process many more requests, for which it is most likely not ready.
@kolinfluence
- It is still unclear why you want to make an offheap cache out of an onheap cache. These caches solve different problems. Moreover, go already has offheap caches and fastcache already uses mmap.
- Okay, why not go to them and ask for this feature?
dedicated cache server (e.g. redis like etc) goes through tcp (50k req/s) / uds (90k req/s)
Very questionable numbers. Have you tested a single redis instance that utilizes of only one core? A Redis cluster on a multicore machine can easily process millions of requests per second, giving you a large number of features that you will be tormented to implement for your offheap cache. The simplest example is consistency, replication, and availability. Which I have already written about...
e.g. 600k req/s both read and write for phuslu 1mil req/s on read for otter and 280k write req/s.
I do not know how you tested, but the results for the onheap cache are very small. Even too much.
phuslu lru is more evenly balanced between reads and writes
Obviously, phuslu/lru will have approximately the same speed for both reads and writes, because it is just a regular lru splitted into a large number of shards, each of which is protected by a mutex.
otter is unbeatable in read speed but write speed is 1/3 of performance of phuslu
Yes, the otter is focused on a read-heavy load. The behavior on different load profiles can be viewed in the README. I tried to show both advantages and disadvantages.
It looks like you tested the behavior on reads=0%,writes=100%
load. Yes, the otter doesn't feel very good on it, because all the optimizations stop working on it and just eat up processor time. But already on reads=25%,writes=75%
load, the otter feels great. So one question needs to be asked here. Does your cache really have a hit ratio = 0%? Do you need a cache at all in such a situation? It seems that in such a situation, even very good caches will not help you.
In principle, caches usually handle read-heavy loads, and write-heavy loads are very rare.
Well, the most fun :). The fact is that otter is more like a single lru cache with a mutex, rather than split into many shards. So you can just split the otter into shards and increase throughput on a write only load. But I don't think anyone needs it, since even a single otter instance is capable of processing millions of write requests per second.
otter can be good for cdn but phuslu for more general purpose.
A general purpose cache is a very complicated thing. There is no such cache in Go at the moment. The choice of the cache is the user's decision, so I would not like to interfere in this, but only provide information useful for making this decision.
The hit ratio has a very strong effect on throughput and P99 latency
Yes, it is undoubtedly correct in most cases. But in some scenarios, especially when I use fastcache, we always try to cache as much or even all content as possible in a single process memory.
In this “slightly rare” scenario, memory usage is slightly more important, and the importance of hit rate is a bit reduced.
Of course, phuslu/lru is not simply a perfect replacement for fastcache (because it is an onheap cache), but I do trying to use it in similar scenarios now.
@maypok86
It is still unclear why you want to make an offheap cache out of an onheap cache. These caches solve different problems. Moreover, go already has offheap caches and fastcache already uses mmap.
how do you perform a high performance read and write concurrently for multiple different and separate golang programs reading and writing to the same cache (mmap or otherwise) either lru or s3-fifo without going through tcp/udp/uds stack at > 2 million of ops / second across those processes concurrently hitting the same "key" field?
we always try to cache as much or even all content as possible in a single process memory.
But you will have a hit ratio = 100% in this scenario. Yes, up to some point you won't care which eviction policy is used, because you will use the cache as a regular hash table, but as soon as the memory runs out, you will immediately face a number of problems. Since data is rarely placed in the memory of a single process, I see no point in considering this case at all. If you know that you will never exceed the required cache size, then why use lru at all?
@maypok86 u should consider having a nottl version like phuslu/lru too.
u should consider having a nottl version like phuslu/lru too.
Just don't use the ttl option and you'll get what you want.
multiple different and separate golang programs reading and writing to the same cache
You have just described a feature that is easily added to any offheap cache. Why do this in the onheap cache is still unclear.
However, mmap is only used by fastcache, and it is almost not supported.
So it's better to try going to offheap caches, or maybe @phuslu implements it.
But you will have a hit ratio = 100% in this scenario.
we hope/pray for this, but sometimes it won't be(depends external inputs) and we somhow bear it.
you will use the cache as a regular hash table
It can indeed be said that -- a fixed-size hashtable
why use lru at all?
It easier implemented lru algothrim in onheap cache, then it can be used as a better "selling point" than fastcache .
@maypok86
You have just described a feature that is easily added to any offheap cache. Why do this in the onheap cache is still unclear.
it's actually extremely difficult. no one has done it yet. there seemed to be an issue with a high performance global lock that can work across separate, independent program / processes.
why dont you try on yours and see what i mean? phuslu need til May 1st to complete it.
Damn, do you really need this? You don't just want a cache library, but a solution optimized for your needs that no one else needs.
why dont you try on yours and see what i mean?
"Easily" may be too big a statement, but I'll repeat it again. I won't add this because it goes far beyond the onheap cache domain. Maybe you should go to the offheap cache repos. Your feature is much more related to their domain.
phuslu need til May 1st to complete it.
Okay, good luck to him with that. Perhaps he will be able to do something better than the existing offheap solutions.
a solution optimized for your needs that no one else needs.
I also have some needs this, because golang has the "performance gap" in large number cores. so I use 8/16 cores VMs with in-memory cache, meanwhile sharding user requests in L7 load balance.
Perhaps he will be able to do something better than the existing offheap solutions.
Currently I'm not confident yet that it can do better than fastcache, but I will give a try.
It looks like you tested the behavior on reads=0%,writes=100% load. Yes, the otter doesn't feel very good on it, because all the optimizations stop working on it and just eat up processor time. ... Do you need a cache at all in such a situation? It seems that in such a situation, even very good caches will not help you.
ok, i have to say this, i need a cache that is fixed length and will evict items when it's full automatically. (lru feature, s3-fifo is fine as well) since you guys' caches can outperform a golang map, might as well just use the implementations you guys have done for different purposes, write heavy, read heavy, balanced etc.
i'm using the cache for updating a timestamp for every reads. that means it must write the timestamp to the read each time it's read. this makes otter perform at 50-50 speed for each read (with write) and another process that does purely write at maybe 3%, making a total of 48.54% read and 51.46% write. (honestly this part is using phuslu/lru as there's something with otter that is "blocking" the writes and not maximizing the cpu usage. this "block" in high performance caching gives an impression of my goroutines needing to do too much "switching", which makes me think phuslu/lru more efficient. i havent done benchmark on actual use but purely on cpu usage and how it is used etc. on "my program", i think my assumption is correct)
for reference of benchmark from freelru:
Adding objects FreeLRU is ~3.5x faster than SimpleLRU, no surprise. But it is also significantly faster than Go maps, which is a bit of a surprise.
This is with 0% memory overcommitment (default) and a capacity of 8192.
BenchmarkFreeLRUAdd_int_int-20 43097347 27.41 ns/op 0 B/op 0 allocs/op BenchmarkFreeLRUAdd_int_int128-20 42129165 28.38 ns/op 0 B/op 0 allocs/op BenchmarkFreeLRUAdd_uint32_uint64-20 98322132 11.74 ns/op 0 B/op 0 allocs/op () BenchmarkFreeLRUAdd_string_uint64-20 39122446 31.12 ns/op 0 B/op 0 allocs/op BenchmarkFreeLRUAdd_int_string-20 81920673 14.00 ns/op 0 B/op 0 allocs/op () BenchmarkMapAdd_int_int-20 35306983 46.29 ns/op 0 B/op 0 allocs/op BenchmarkMapAdd_int_int128-20 30986126 45.16 ns/op 0 B/op 0 allocs/op BenchmarkMapAdd_string_uint64-20 28406497 49.35 ns/op 0 B/op 0 allocs/op
@maypok86 the high performance aspect of why mmap / shared memory will be needed is for extreme use case. someday you may see more people looking for this. phuslu seemed to understand why it's needed in his "bidding" system type. i'm doing this for tracking traffic analysis. yes, the ultimate performance for other areas is already written in c but golang is used for the "real time" analysis / monitoring aspect.
golang has the "performance gap" in large number cores.
Uh, what? I've never seen anything like it. Are you talking about this issue? If so, have you tried gnet?
yes i've tried. even this https://betterprogramming.pub/gain-the-new-fastest-go-tcp-framework-40ec111d40e6?gi=43a44976d895
but i've worked around the issue.
updating a timestamp for every reads.
Why can't you just atomically update the time when reading?
But it is also significantly faster than Go maps, which is a bit of a surprise.
What? freelru uses a standard map internally, how can it be faster? It looks very doubtful. Very much so.
Why can't you just atomically update the time when reading?
the time is tracked for each item read for ip traffic analysis, it's tied to each ip read not a global atomic counter.
it's written there: https://github.com/elastic/go-freelru
i am having issue with a global mutex lock for inter process golang concurrent mmap / share memory writes. do u know how to get one done? the only working one seems to be flock so far. but it's really slow
it's tied to each ip read not a global atomic counter.
Well, anyway, it's not clear why you can't do this.
v, ok := cache.Get(key)
if !ok {
return
}
v.Time.Store(time.Now().Unix())
do u know how to get one done?
Unfortunately, it looks like a huge ball of crutches. I'm not sure you'll be able to get more rps than with tcp/udp without a lot of effort.
@maypok86 i havent tested the time store benchmark. if it's not "Set" speed then it can work for some use case.
pls dont mention tcp/uds again coz i've enhanced my own redis server using this package and even this is considered slow to me. https://github.com/IceFireDB/redhub
need mmap / shared memory for inter-process lru cache.
for the tcp rps etc. u can reference here:
https://stackoverflow.com/questions/1235958/ipc-performance-named-pipe-vs-socket