flagd icon indicating copy to clipboard operation
flagd copied to clipboard

[BUG] Spans from flag evaluation not being created in Java

Open julianocosta89 opened this issue 1 year ago • 13 comments

Observed behavior

When updating the dependencies for the AdService in the OTel demo, I've noticed that the flagd spans are not being created in Java. We do have the flagd spans connected with RecommendationService (Python), CartService (.NET) and Frontend-proxy (Envoy).

But for AdService no. Even though, we follow the recommended approach documented here: https://flagd.dev/providers/java/#opentelemetry-tracing-rpc-only

https://github.com/open-telemetry/opentelemetry-demo/blob/main/src/adservice/src/main/java/oteldemo/AdService.java#L86-L94

Expected Behavior

Whenever AdService checks the state of the FeatureFlag, I'd expect 3 spans from flagd to be created, as it is happening with other services.

Steps to reproduce

git clone [email protected]:open-telemetry/opentelemetry-demo.git
cd opentelemetry-demo
docker compose up -d

Navigate to Jaeger UI: http://localhost:8080/jaeger/ui/

Check the received traces from AdService and Flagd.

julianocosta89 avatar Jul 24 '24 13:07 julianocosta89

@julianocosta89 I had a look into this. The service use withGlobalTelemetry(true) which internally obtains an OpenTelemetry instance from GlobalOpenTelemetry.get(). This follows Java manual instrumentation guidelines [1]. And in the AdService, there is no manual configuration initializations and this is the root cause for this observation.

[1] - https://opentelemetry.io/docs/languages/java/instrumentation/#manual-configuration

Kavindu-Dodan avatar Jul 24 '24 21:07 Kavindu-Dodan

@Kavindu-Dodan we are using the OTel Java agent, isn't that done automatically? We have another service, written in Kotlin, and for that it is working fine. https://github.com/open-telemetry/opentelemetry-demo/blob/main/src/frauddetectionservice/src/main/kotlin/frauddetectionservice/main.kt#L32-L36

julianocosta89 avatar Jul 25 '24 08:07 julianocosta89

I've done some tests again and it seems that it is actually creating spans, but for some reason just once. Here is the trace containing the flagd spans: Screenshot 2024-07-25 at 12 21 23

Here are the metrics from one of the flags from AdService: Screenshot 2024-07-25 at 12 20 43

As we can see, the flag is being queried every time AdService is called, but the spans from flagd are just arriving once. Maybe we have some error in the implementation of AdService.

julianocosta89 avatar Jul 25 '24 10:07 julianocosta89

Hey @julianocosta89, this is an optimization in the Java provider that can disabled. Basically, a value can be cached if it's "static" and the provider is able to establish a persistent connection with flagd to bust the cache when the flag configuration changes.

Here are the configuration options that are available. https://github.com/open-feature/java-sdk-contrib/tree/main/providers/flagd#configuration-options

beeme1mr avatar Jul 25 '24 14:07 beeme1mr

Hey @julianocosta89, this is an optimization in the Java provider that can disabled. Basically, a value can be cached if it's "static" and the provider is able to establish a persistent connection with flagd to bust the cache when the flag configuration changes.

Here are the configuration options that are available. https://github.com/open-feature/java-sdk-contrib/tree/main/providers/flagd#configuration-options

Oh, cool TIL.

Any reason the behaviour is different for Java and Kotlin? Both are using the same dependencies and none have the cache disabled.

julianocosta89 avatar Jul 25 '24 20:07 julianocosta89

There are a few reasons why a value wouldn't be cashed. For example, if there's a targeting rule in the flag configuration. I'll take a look tomorrow.

beeme1mr avatar Jul 25 '24 21:07 beeme1mr

Hey @julianocosta89, can you confirm that the kafkaQueueProblems doesn't have a targeting rule defined in your environment? That would immediately make it ineligible to be cached. I didn't notice any obvious reasons this isn't being cached, but I didn't have a chance to investigate in depth.

beeme1mr avatar Jul 26 '24 19:07 beeme1mr

this is what I currently have for kafkaQueueProblems:

"kafkaQueueProblems": {
  "description": "Overloads Kafka queue while simultaneously introducing a consumer side delay leading to a lag spike",
  "state": "ENABLED",
  "variants": {
    "on": 100,
    "off": 0
  },
  "defaultVariant": "off"
},

And this is the kotlin snippet:

fun main() {
    val options = FlagdOptions.builder()
    .withGlobalTelemetry(true)
    .build()
    val flagdProvider = FlagdProvider(options)
    OpenFeatureAPI.getInstance().setProvider(flagdProvider)

    val props = Properties()
    props[KEY_DESERIALIZER_CLASS_CONFIG] = StringDeserializer::class.java.name
    props[VALUE_DESERIALIZER_CLASS_CONFIG] = ByteArrayDeserializer::class.java.name
    props[GROUP_ID_CONFIG] = groupID
    val bootstrapServers = System.getenv("KAFKA_SERVICE_ADDR")
    if (bootstrapServers == null) {
        println("KAFKA_SERVICE_ADDR is not supplied")
        exitProcess(1)
    }
    props[BOOTSTRAP_SERVERS_CONFIG] = bootstrapServers
    val consumer = KafkaConsumer<String, ByteArray>(props).apply {
        subscribe(listOf(topic))
    }

    var totalCount = 0L
    consumer.use {
        while (true) {
            totalCount = consumer
                .poll(ofMillis(100))
                .fold(totalCount) { accumulator, record ->
                    val newCount = accumulator + 1
                    if (getFeatureFlagValue("kafkaQueueProblems") > 0) {
                        logger.info("FeatureFlag 'kafkaQueueProblems' is enabled, sleeping 1 second")
                        Thread.sleep(1000)
                    }
                    val orders = OrderResult.parseFrom(record.value())
                    logger.info("Consumed record with orderId: ${orders.orderId}, and updated total count to: $newCount")
                    newCount
                }
        }
    }
}

/**
* Retrieves the status of a feature flag from the Feature Flag service.
*
* @param ff The name of the feature flag to retrieve.
* @return `true` if the feature flag is enabled, `false` otherwise or in case of errors.
*/
fun getFeatureFlagValue(ff: String): Int {
    val client = OpenFeatureAPI.getInstance().client
    // TODO: Plumb the actual session ID from the frontend via baggage?
    val uuid = UUID.randomUUID()

    val clientAttrs = mutableMapOf<String, Value>()
    clientAttrs["session"] = Value(uuid.toString())
    client.evaluationContext = ImmutableContext(clientAttrs)
    val intValue = client.getIntegerValue(ff, 0)
    return intValue
}

julianocosta89 avatar Jul 29 '24 06:07 julianocosta89

@julianocosta89 could it be that you are evaluating the same flag in quick succession, and evaluations are interleaving? The flag value can't be cached until the RPC completes, so if 2 evaluations happen quickly, they will both cause RPCs since theirs no previously resolved value in the cache (yet)..

Besides some transient connection issue, I can't see another reason. If the reason field is coming back as STATIC, subsequent evaluations should result in cache hits with your configuration :thinking:

EDIT:

Oh, one more reason could be that the provider hasn't fully connected its stream before which no caching happens... this is plausible since you're using setProvider (which doesn't block awaiting a connection) instead of setProviderAndWait which blocks until the provider is fully initialized.

toddbaert avatar Aug 13 '24 17:08 toddbaert

I am doing a little bit of spring issue cleaning, @julianocosta89 is this issue still present?

aepfli avatar Mar 13 '25 09:03 aepfli

Yes, the 3 feature flags we have for Ad are being queried constantly, as we can see on the evaluation metric:

Image

But on Traces, this happens only once:

Image

All other Traces do not have the flagd spans.

One thing that I also noticed while double checking was that I'm getting a couple of traces with 10min duration,and 31 span events 🤯

Maybe those 2 cases are interconnected:

Image

julianocosta89 avatar Mar 13 '25 11:03 julianocosta89

Hey @julianocosta89, the 10-minute traces are for the event stream to watch for flag change events. This is expected behavior but we may want to adjust how it's monitored in the OTel demo. By the way, will you be at Contribfest? A few OpenFeature maintainers were planning on attending and it may be a good opportunity to work on some improvements.

beeme1mr avatar Mar 13 '25 12:03 beeme1mr

Hey @julianocosta89, the 10-minute traces are for the event stream to watch for flag change events. This is expected behavior but we may want to adjust how it's monitored in the OTel demo. By the way, will you be at Contribfest? A few OpenFeature maintainers were planning on attending and it may be a good opportunity to work on some improvements.

Sounds great, I'll be there and I'm planning on spending a lot of time on the Observatory as well! Hope to meet you there 😊

julianocosta89 avatar Mar 13 '25 12:03 julianocosta89