protoactor-go icon indicating copy to clipboard operation
protoactor-go copied to clipboard

Memory leak when actor metrics are enabled

Open antoniomacri opened this issue 2 years ago • 4 comments

Describe the bug Enabling actor metrics:

	system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

causes a memory leak when actors are stopped.

To Reproduce Run the following main:

package main

import (
	"fmt"
	"github.com/asynkron/protoactor-go/actor"
	"github.com/google/uuid"
	"go.opentelemetry.io/otel/metric/global"
	"log"
	"math/rand"
	"net/http"
	_ "net/http/pprof"
	"os"
	"os/signal"
	"syscall"
	"time"
)

func main() {
	system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

	go func() {
		log.Println(http.ListenAndServe("localhost:6060", nil))
	}()

	for i := 0; i < 20_000; i++ {
		message := &MyData{id: uuid.New().String(), value: randomString(100_000)}
		props := actor.PropsFromProducer(func() actor.Actor {
			return &myActor{data: make(map[string]*MyData)}
		})
		pid, err := system.Root.SpawnNamed(props, message.id)
		if err != nil && err != actor.ErrNameExists {
			fmt.Printf("Error while sending message to actor=%v err=%v\n", pid, err)
		}
		system.Root.Send(pid, message)
		if err := system.Root.PoisonFuture(pid).Wait(); err != nil {
			fmt.Printf("Error while terminating actor=%v err=%v\n", pid, err)
		}
		time.Sleep(5 * time.Millisecond)
	}
	fmt.Println("-- Done.")

	sigs := make(chan os.Signal, 1)
	signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
	<-sigs
}

type myActor struct {
	data map[string]*MyData
}

type MyData struct {
	id    string
	value string
}

func (a *myActor) Receive(ctx actor.Context) {
	switch msg := ctx.Message().(type) {
	case *MyData:
		a.data[msg.id] = msg
	}
}

func randomString(length int) string {
	const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
	b := make([]byte, length)
	for i := range b {
		b[i] = letterBytes[rand.Intn(len(letterBytes))]
	}
	return string(b)
}

A few GB of RAM are retained even though the actors are stopped immediately after sending the data message.

By replacing

	system := actor.NewActorSystem(actor.WithMetricProviders(global.MeterProvider()))

with

	system := actor.NewActorSystem()

the memory leak disappears.

Expected behavior The memory is released when the actors are stopped.

Additional context In props.go the default spawner contains this code:

		ctx := newActorContext(actorSystem, props, parentContext.Self())
		mb := props.produceMailbox()

		// prepare the mailbox number counter
		if ctx.actorSystem.Config.MetricsProvider != nil {
			sysMetrics, ok := ctx.actorSystem.Extensions.Get(extensionId).(*Metrics)
			if ok && sysMetrics.enabled {
				if instruments := sysMetrics.metrics.Get(metrics.InternalActorMetrics); instruments != nil {
					sysMetrics.PrepareMailboxLengthGauge()
					meter := global.Meter(metrics.LibName)
					if err := meter.RegisterCallback([]instrument.Asynchronous{instruments.ActorMailboxLength}, func(goCtx context.Context) {
						instruments.ActorMailboxLength.Observe(goCtx, int64(mb.UserMessageCount()), sysMetrics.CommonLabels(ctx)...)
					}); err != nil {
						err = fmt.Errorf("failed to instrument Actor Mailbox, %w", err)
						plog.Error(err.Error(), log.Error(err))
					}
				}
			}
		}

Specifically the closurefunc(goCtx context.Context) captures mb and ctx, and is not unregistered after the actor is stopped.

This seems to be the root cause of the memory leak.

antoniomacri avatar Oct 18 '22 10:10 antoniomacri

Also, it's not clear to me how that metric (ActorMailboxLength) should work. Shouldn't it sum the mailbox length for all the living actors? It instead seems to replace the gauge value with the mailbox length of the last called back actor.

antoniomacri avatar Oct 18 '22 10:10 antoniomacri

I'm keen to help with this issue.

Any hint on how to fix the leak and correctly implement the mailbox metric?

antoniomacri avatar Jan 05 '23 11:01 antoniomacri

Thanks for the detailed report. I have to dive into this to see what the memory issue could be

rogeralsing avatar Jan 05 '23 14:01 rogeralsing

It should be due to the registered callback for each actor, which keeps references to its mailbox and context via the closure.

So changing the callback logic should also fix the leak.

I'm thinking about registering a single callback in the OpenTelemetry Meter. Go implementation of the Meter doesn't provide a way to unregister the callback (although this seems mandated by the spec...). Therefore this should be handled explicitly. WDYT?

Thanks for the response.

antoniomacri avatar Jan 05 '23 14:01 antoniomacri