opa icon indicating copy to clipboard operation
opa copied to clipboard

OPA out of memory

Open itayhac opened this issue 1 year ago • 13 comments

we are working with OPA as our policy agent. we deploy multiple instances of OPA as docker containers on kubernetes. Each OPA has k8s memory limit of 4GB. also, each OPA loads a bundle with data.json file of about ~15Mb.

recently we have noticed that some of our OPA instances have been restarted due to OOM. after further investigation we have found out that it happens when OPA is receiving frequent requests and memory fails to get free fast enough, which in turns results in OOM very fast (within 3 seconds).

Disclaimer: the bundle i share here is a mock data that best mimics our use case. i will share the heapdump that we got for the mimic data, and for actual production data (both with same rego code).

Please note, these are functions are taking almost 90 percent of the memory and the service gets OOMed out within seconds. image

this is also true for our production memory profile.

  • OPA version - latest
  • bundle is provided.
  • Example of the both memory profile (using pprof) (both for mock data bundle and for production data run)
  • Go code that sends 100 requests to the local OPA.

Steps To Reproduce

run the following command to start OPA: opa run --bundle itay_kenv_files/test_15mb.tar.gz --server --pprof --log-level=info run the code to trigger OPA requests

Expected behavior

memory should remain low or at least get free shortly after the requests are being made.

Code that sends 100 request to OPA

package main

import (
	"bytes"
	"fmt"
	"log"
	"net/http"
	"sync"
	"time"
)

const (
	iterationsNumber = 100

)

func main() {
	log.Println("Starting OPA testing")

	var wg sync.WaitGroup
	wg.Add(iterationsNumber)
	for i := 0; i <= iterationsNumber; i++ {
		time.Sleep(40 *time.Millisecond)
		go sendRequest(i)
	}
	wg.Wait()
	fmt.Println("All go routines have finished.")

}

func sendRequest(i int) {
	log.Println("Sending request to opa. iteration number: ", i)

	// URL to which the POST request will be sent
	url := "http://localhost:8181/v1/data/test_policy/evaluator/access"
	
	jsonStr := []byte(`{
  		"input":{
	  	}
	}`)


	// Create a new HTTP request with POST method, specifying the URL and the request body
	req, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonStr))
	if err != nil {
		log.Println("Error creating request:", err)
		return
	}

	// Set the Content-Type header to application/json since we're sending JSON data
	req.Header.Set("Content-Type", "application/json")

	// Create a new HTTP client
	client := &http.Client{}

	// Send the request via the HTTP client
	resp, err := client.Do(req)
	if err != nil {
		log.Println("Error sending request:", err)
		return
	}

	defer resp.Body.Close()

	// Print the HTTP response status code
	log.Println("Response Status:", resp.Status)
}

test_15mb.tar.gz memory profile.zip

if further information regarding our production setup is required ill be happy to provide it.

itayhac avatar May 20 '24 13:05 itayhac

Thanks for the detailed issue @itayhac. I tried to reproduce this by running OPA on docker and setting a 4GB memory limit. I increased the number of go routines from your script to send more concurrent requests to OPA. The maximum amount of memory consumed by OPA did not cross 200 MB. Is there something different in your actual setup vs the mock bundle you've provided here? I would expect the CPU usage to spike while OPA handles these requests but it's still unclear why OPA runs OOM.

ashutosh-narkar avatar May 20 '24 23:05 ashutosh-narkar

Hi @ashutosh-narkar , thank you so much for you fast and detailed reply. i changed the files to reproduce the issue with 4GB memory (i increased the size and structure of the data.json file).

please retry and it should be reproduced.

itayhac avatar May 21 '24 06:05 itayhac

One thing I noticed in the policy is you're using the object.get builtin on the data set instead of just accessing under data.rules for example. You can probably avoid using the builtin. Another thing I noticed when I run the stress test with the openpolicyagent/opa:0.64.1-static image variant there is no significant increase in memory. Have you seen that as well?

ashutosh-narkar avatar May 22 '24 23:05 ashutosh-narkar

any further thoughts? @ashutosh-narkar can we label it as bug and prioritize it?

itayhac avatar May 26 '24 08:05 itayhac

@itayhac can you please confirm if you're able to repro this issue with the upstream OPA images including any differences with the static variant. You mentioned (in a separate thread) that y'all are building your own images. Also this could be a relevant issue.

ashutosh-narkar avatar May 28 '24 19:05 ashutosh-narkar

the problem is reproduced with our own OPA image (we compile latest), and with both latest public images (static and non-static)

itayhac avatar May 29 '24 03:05 itayhac

This could be related to https://github.com/open-policy-agent/opa/issues/5946. In your policy you're referring to a large object and this can be replicated if you modify the policy to refer to the object w/o using the object.get builtin. @johanfylling did you encounter something like this while working on https://github.com/open-policy-agent/opa/pull/6040 ?

ashutosh-narkar avatar May 29 '24 17:05 ashutosh-narkar

@ashutosh-narkar, the work in #6040 focused solely on the CPU time aspect, and did not look at how memory usage was affected.

johanfylling avatar May 29 '24 21:05 johanfylling

The data has some objects and arrays and I wonder if when referenced inside of the policy the interface-AST conversions are impacting performance in terms of CPU and memory.

ashutosh-narkar avatar May 29 '24 22:05 ashutosh-narkar

We're looking to implement something like discussed in https://github.com/open-policy-agent/opa/issues/4147. This should probably help with performance as we'll avoid the interface to AST conversion during eval.

ashutosh-narkar avatar May 31 '24 20:05 ashutosh-narkar

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.

stale[bot] avatar Jul 06 '24 07:07 stale[bot]

@itayhac are you able to repro this with OPA v0.67.0? I was unable to repro this so would be good to verify incase I missed something.

ashutosh-narkar avatar Jul 31 '24 01:07 ashutosh-narkar

This issue has been automatically marked as inactive because it has not had any activity in the last 30 days. Although currently inactive, the issue could still be considered and actively worked on in the future. More details about the use-case this issue attempts to address, the value provided by completing it or possible solutions to resolve it would help to prioritize the issue.

stale[bot] avatar Aug 30 '24 01:08 stale[bot]

Closing as there haven't been anymore reported issues, but please re-open if you run into this issue again. Thanks!

sspaink avatar Apr 10 '25 12:04 sspaink

I've got a team at work that has been reporting a similar issue for a few months. K8s deployment using OPA as a sidecar. OPA is configured to get 4-5 bundles from a bundle server every few minutes. The bundles are all small; on the order of kB. Their container has a memory limit of 512MB and periodically gets OOMKilled.

I have been unable to reproduce locally. They're currently on version 0.70.0 and have turned on the --optimize-store-for-read-speed flag since there was some language about that helping with memory spikes. We're working on some changes so that we can move to 1.x in hopes of some improvements, but that will take a little time.

I'd appreciate any ideas on how to debug this. I am curious if there is something about the bundle reload process that could be contributing to the issue. I haven't narrowed down in the OOMKills happen at/just-after a bundle fetch though.

jwineinger avatar Apr 16 '25 16:04 jwineinger

@jwineinger thank you for reporting this! Could you open a new issue so we can look into this more? Adding any logs to the issue would be very helpful to debug it. We will have to confirm that v1 will actually fix it.

sorry for pinging everyone with opening/closing it 😅

sspaink avatar Apr 22 '25 13:04 sspaink