opik icon indicating copy to clipboard operation
opik copied to clipboard

[FR]: Flush partially completed traces - flush as they come

Open gustavhartz opened this issue 9 months ago • 10 comments

Proposal summary

It would be great if traces could flush as they arrive for partial elements, instead of awaiting the top level function/trace to finish. If the code raises an exception I don't get any data, which sucks for a case with 500s execution.

Additionally, I don't want to wait 500s to get any data to look at, and want to be able to abort runs early from monitoring this data.

OPIK_DEFAULT_FLUSH_TIMEOUT did not have an effect

Motivation

  • What problem are you trying to solve?: loss of failed run data & convenience. Don't want to wait for a the top level function/trace
  • How are you currently solving this problem?: monitoring our internal log system
  • What are the benefits of this feature?: Better understanding of functionality even in failed and aborted runs

gustavhartz avatar Mar 21 '25 15:03 gustavhartz

Thanks for reporting @gustavhartz. We will look into the best way to support this. Can you share a bit about your use case? Are these spans coming from a production workflow or this is in offline experimentation?

gidim avatar Mar 21 '25 21:03 gidim

For a long running trace with a lot of sub traces and LLM calls, we cannot see the trace showing in the dashboard until the entire job is completed.

Ideally, we would want to see the trace & data being added as it is logged into the server.

NaxAlpha avatar Mar 22 '25 08:03 NaxAlpha

There are two cases

  • Avoid loss of logs in case of errors
  • Faster monitoring of results

Case one is only really intended for dev/test env as we don't plan on bugs in prod and have opik disabled on prod

Case two is for long running jobs, it's inefficient to wait 500s plus for the logs to come in. Also in case we want to abort runs

I'm looking into implementing it to an existing large code base, thus we are a little limited on implementation options

Let me know if you need anything g else

gustavhartz avatar Mar 22 '25 09:03 gustavhartz

Thanks @gustavhartz , ill take this with our team and see how we can better improve this use case

Nimrod007 avatar Mar 23 '25 15:03 Nimrod007

Hi @gustavhartz, Thanks again for the suggestion!

We've synced internally and identified a technical solution that will allow trace and span visibility upon submission, without needing to wait for task completion.

Since this requires several changes across our data ingestion workflow (both on the SDK and backend), we're planning to start work on it soon. Just to set expectations, this will take some time to complete (likely over the next few weeks)

We'll keep you posted as we make progress

Nimrod007 avatar Mar 24 '25 15:03 Nimrod007

sounds good 👍

gustavhartz avatar Mar 25 '25 11:03 gustavhartz

Hi @gustavhartz

Thank you for raising this issue! I'm Andrés, a Principal Engineer at Comet, and I wanted to let you know that I’ll be taking this forward. My initial focus will be on creating a proof of concept to address the problem you've outlined.

I'll share updates here as I make progress. Please feel free to provide any additional context that you think would be helpful.

Looking forward to working on this!

Best, Andrés

andrescrz avatar Apr 16 '25 10:04 andrescrz

Sounds good @andrescrz! I think all is captured here

gustavhartz avatar Apr 23 '25 12:04 gustavhartz

Hi @gustavhartz,

The first change at the trace level has been successfully implemented in the backend. The next step will be to propagate these changes to the spans.

Once that is completed, we’ll proceed to schedule the SDK changes.

Best regards, Andrés

andrescrz avatar Apr 30 '25 15:04 andrescrz

Hi @gustavhartz,

Thank you for your patience. I wanted to let you know that support for this feature has been fully implemented in the backend. We are planning to schedule the SDK implementation for the next sprint.

We will keep you updated on our progress.

Best regards, Andrés

andrescrz avatar May 26 '25 12:05 andrescrz

This change is highly anticipated!

stefanadelbert avatar Jun 03 '25 23:06 stefanadelbert

This change is highly anticipated!

Thanks @stefanadelbert! I'm glad to say that it will be available soon (under a feature flag). The Python SDK part for traces is on PR already: https://github.com/comet-ml/opik/pull/2333 and then the same PR will come after for spans (work already in progress).

andrescrz avatar Jun 04 '25 08:06 andrescrz

Hi everyone,

We’re excited to share that the recent changes to the Python SDK have been merged. These updates are currently behind a feature flag named log_start_trace_span, which is disabled by default for now.

We’ll be releasing and deploying these changes very soon. Following that, our plan is to enable the feature flag by default during the following week.

Thank you for your patience and support. Stay tuned for more updates!

Best regards, Andrés

andrescrz avatar Jun 12 '25 09:06 andrescrz

Hi everyone!

Support for long-running traces and spans has been improved and is now available in the latest release. You can update the OPIK SDK to start using it. Please note that this feature is not enabled by default yet, but you can activate it by setting the appropriate environment variable, as shown in the following snippet:

import os

os.environ["OPIK_LOG_START_TRACE_SPAN"] = "True"

Happy coding!

Iaroslav

yaricom avatar Jun 13 '25 08:06 yaricom

Thanks! The Opik release version where this is available is 1.7.34. which is also fully deployed to comet.com.

We'll work to enable the flag by default in the upcoming weeks.

Stay tuned!

andrescrz avatar Jun 13 '25 09:06 andrescrz

I've just tried this feature out. I set the environment variable (export OPIK_LOG_START_TRACE_SPAN=True) and then created an Opik client, created a trace, added some spans and then closed the trace. I was able to see the updates in the Opik backend after each step, which is what this change is all about. Good stuff.

stefanadelbert avatar Jun 21 '25 03:06 stefanadelbert