A corner case will cause panic.
This demo will panic:
package main
import (
"github.com/buraksezer/consistent"
"github.com/cespare/xxhash"
)
type hasher struct{}
type myMember string
func (m myMember) String() string {
return string(m)
}
func (h hasher) Sum64(data []byte) uint64 {
// you should use a proper hash function for uniformity.
return xxhash.Sum64(data)
}
func main() {
cfg := consistent.Config{
PartitionCount: 1,
ReplicationFactor: 1,
Load: 1,
Hasher: hasher{},
}
c := consistent.New(nil, cfg)
node1 := myMember("node1")
c.Add(node1)
}
Here is the error message:
panic: not enough room to distribute partitions
goroutine 1 [running]:
github.com/buraksezer/consistent.(*Consistent).distributeWithLoad(0x450060, 0x0, 0x0, 0x43e2e0, 0x43e2c0, 0x34c96acd)
/tmp/gopath552882815/pkg/mod/github.com/buraksezer/[email protected]/consistent.go:165 +0x3a0
github.com/buraksezer/consistent.(*Consistent).distributePartitions(0x450060, 0x1604d0)
/tmp/gopath552882815/pkg/mod/github.com/buraksezer/[email protected]/consistent.go:196 +0xc0
github.com/buraksezer/consistent.(*Consistent).Add(0x450060, 0x1604d0, 0x40c150, 0x2b5f)
/tmp/gopath552882815/pkg/mod/github.com/buraksezer/[email protected]/consistent.go:227 +0x180
main.main()
/tmp/sandbox195891616/prog.go:30 +0xe0
So this is expected behavior or a bug?
It seems this condition is wrong(https://github.com/buraksezer/consistent/blob/master/consistent.go#L163):
if count >= len(c.sortedSet) {
Is it reanable to remove equal?:
if count > len(c.sortedSet) {
Hello,
Your configuration is invalid. So this is the expected behavior. You should use something like that:
// Create a new consistent instance
cfg := consistent.Config{
PartitionCount: 7,
ReplicationFactor: 20,
Load: 1.25,
Hasher: hasher{},
}
Configuration section in README file explains these variables https://github.com/buraksezer/consistent#configuration
If you explain your use case, I may help further for the configuration.
Thanks!
I have checked the definition of variables in consistent.Config but I still can't understand why the case is wrong:
cfg := consistent.Config{
PartitionCount: 1,
ReplicationFactor: 1,
Load: 1,
Hasher: hasher{},
}
c := consistent.New(nil, cfg)
node1 := myMember("node1")
c.Add(node1)
One node with one replication for one partition seems reasonable.
@buraksezer is it possible for you to fix what @ShiKaiWi is reporting? i'm running into a similar issue as well. this patch seems reasonable and that's what i did to keep going.
we may want to run (unit/functional) tests where we try to squeeze a small number of partitions on a single node and the logic here does seem incorrect. the count that checks to see if we have exhausted all nodes should be incremented only after we check to see if the partition cannot be accommodated - not before. either the count should be incremented at the end of the loop or the patch that @ShiKaiWi provides should be done.
if you wish, i can create a pr as well. please let us know.
@sriramch firstly, I'm so sorry for the late response. The reported configuration by @ShiKaiWi was invalid. Why do you want to use a consistent hash function if you have only one partition? ReplicationFactor and Load parameters should be bigger than 1. Furthermore, these parameters should be chosen wisely.
If you provide a sample configuration, I may understand what the actual problem is. I'm OK with the changes, if it fixes a real problem.
@buraksezer thanks for responding!
as i mentioned earlier, we may want to run unit/functional tests with a smaller number of partitions on one or few nodes with just 1 replica to simulate some failure scenarios. for instance, the following would panic.
here the PartitionCount and the Load are indeed > 1.
numNodes := 2 // 1 or 2
numReplicas := 1
members := []consistent.Member{}
for i := 0; i < numNodes; i++ {
member := Member(fmt.Sprintf("node%d.olricmq", i))
members = append(members, member)
}
cfg := consistent.Config{
PartitionCount: 10, // partition range for modulo
ReplicationFactor: numReplicas, // how many times each member is replicated
Load: 1.1,
Hasher: murmur3_hasher{},
}
c := consistent.New(members, cfg)
my earlier comment explains the cause and a potential fix. why can't it be accommodated, and why is it invalid to have a non trivial number of partitions on one or a couple of nodes with one replica? why is this incorrect conceptually?
shouldn't the count (that tracks if we have exhausted all nodes) be incremented only after checking to see if a partition cannot be accommodated on the node as opposed to before as it is currently done?
@ShiKaiWi @sriramch I think @buraksezer is right.
- If you want a bounded load you cannot put whatsoever edge cases that breaks the rule
math.Ceil(c * m / n). On the other hand though I recommend we can pass an error instead of panicking. We can also check in the constructor if the parameters is valid. - On the other hand, I'm also unsure why the
countis incremented before. This means onlyn-1nodes is examined before exhaustion. - But does fixing (2) also fixes the panic in (1)? There are plenty of other parameter configuration that won't fix the panic regardless we fix the counter or not because the underlying parameters just doesn't make sense for load balancing.
Meanwhile for @buraksezer I have a mathematical question on the parameter settings. Is it possible to derive mathematically that a given the values for parameters ReplicaFactor, PartitionCount, and Load would be valid or invalid?
In addition to that, what would be a good parameter set for production environment? (good as in, just valid. performant wise I can definitely benchmark test). Let's say on average, I will have 5.6 members. Variance 1.2 memberSquare.
@iqDF - thanks for your comments.
If you want a bounded load you cannot put whatsoever edge cases that breaks the rule
math.Ceil(c * m / n)
that is an implementation detail. as i asked earlier, conceptually why is distributing a non trivial number of partitions over one or a couple of nodes with one replica incorrect (with a load factor > 1)?
But does fixing (2) also fixes the panic in (1)?
it should and i have mentioned it here as well last line in the 2nd paragraph
There are plenty of other parameter configuration that won't fix the panic regardless we fix the counter or not because the underlying parameters just doesn't make sense for load balancing.
can you please provide some examples? i was under the impression that we should be able to accommodate p partitions on n nodes with r replicas (for values of p/n/r that conceptually makes sense) by adjusting the load factor