microcluster icon indicating copy to clipboard operation
microcluster copied to clipboard

`SelectRandom` panics on single-node deployment.

Open mkalcok opened this issue 5 months ago • 4 comments

The function that should return client for one of the cluster members panics if the deployment has only one member.

It seems that on single-node deployments, the size of the "cluster" (c) is 0, causing it to call rand.Intn function with an invalid argument 0.

Stacktrace:

2025/07/14 15:40:51 http: panic serving @: invalid argument to Intn
goroutine 155 [running]:
net/http.(*conn).serve.func1()
/snap/go/10828/src/net/http/server.go:1903 +0xbe      
panic({0x9d55c0?, 0xbafdb0?})
/snap/go/10828/src/runtime/panic.go:770 +0x132
math/rand.(*Rand).Intn(0xab45f0?, 0xc000444660?)
/snap/go/10828/src/math/rand/rand.go:180 +0x4c        
math/rand.Intn(0x0)
/snap/go/10828/src/math/rand/rand.go:453 +0x25        
github.com/canonical/microcluster/v2/client.Cluster.SelectRandom({0xff8980, 0x0, 0x140?})
/root/go/pkg/mod/github.com/canonical/microcluster/[email protected]/client/cluster.go:14 +0x5b

I don't have live code with this bug anywhere, but this snippet roughly shows the reproducer.

func ovnCentralIpsUpdated(ctx context.Context, s state.State, key string, _ string) error {
	errMsgPrefix := fmt.Sprintf("handling of '%s' config failed.", key)
	cluster, err := s.Cluster(false)
	if err != nil {
		logger.Errorf("failed to trigger OVN environment refresh. %v", err)
		errMsg := fmt.Sprintf(
			"%s Failed to trigger environment update in the OVN cluster. "+
				"Config value was not successfully applied. Please see logs for more details.",
			errMsgPrefix,
		)
		return fmt.Errorf("%s", errMsg)
	}

	client := cluster.SelectRandom()
        ...

mkalcok avatar Jul 14 '25 16:07 mkalcok

Hi, thanks for raising awareness on this one!

There are two more SelectRandom funcs which seem to use the same type of logic. But what surprises me is that none of those three functions is actually used within Microcluster itself.

The one from the internal/trust package we can safely remove in v2, but for the other two we should provide an alternative implementation which doesn't panic. For v3 I would argue we can delete all three of them, or do you want to consume such type of logic from Microcluster? Because in this case we might keep it.

@mkalcok have you recently started using this func? Because in all of our single node MicroCloud tests I have never encountered this issue.

roosterfish avatar Jul 15 '25 07:07 roosterfish

For v3 I would argue we can delete all three of them, or do you want to consume such type of logic from Microcluster? Because in this case we might keep it.

I ended up not using it, for obvious reasons 😆. I was in a situation where, during the handling of a client request on the server side, I needed to send out another request (didn't matter to which cluster node). This seemed like a convenient way to get client. I ended up using State.Leader() instead.

@mkalcok have you recently started using this func? Because in all of our single node MicroCloud tests I have never encountered this issue.

Yes, I was just trying it out as a part of implementing a new feature in MicroOVN. I never used it before.

mkalcok avatar Jul 15 '25 07:07 mkalcok

I see thanks. It feels like a good addition then to have a utility like this. Also as it's already in v2 we add the fix to both v3 and v2 so you could potentially start using it afterwards.

The issue is now marked as bug as we clearly need to address the panic. In addition the SelectRandom on the internal/trust package should be removed from both v2 and v3 and the third SelectRandom in rest/types should receive the same fix but can be removed only in v3 to not break v2.

roosterfish avatar Jul 15 '25 07:07 roosterfish

I'll pick this one up next pulse :)

markylaing avatar Jul 17 '25 09:07 markylaing