[FR]: Flush partially completed traces - flush as they come
Proposal summary
It would be great if traces could flush as they arrive for partial elements, instead of awaiting the top level function/trace to finish. If the code raises an exception I don't get any data, which sucks for a case with 500s execution.
Additionally, I don't want to wait 500s to get any data to look at, and want to be able to abort runs early from monitoring this data.
OPIK_DEFAULT_FLUSH_TIMEOUT did not have an effect
Motivation
- What problem are you trying to solve?: loss of failed run data & convenience. Don't want to wait for a the top level function/trace
- How are you currently solving this problem?: monitoring our internal log system
- What are the benefits of this feature?: Better understanding of functionality even in failed and aborted runs
Thanks for reporting @gustavhartz. We will look into the best way to support this. Can you share a bit about your use case? Are these spans coming from a production workflow or this is in offline experimentation?
For a long running trace with a lot of sub traces and LLM calls, we cannot see the trace showing in the dashboard until the entire job is completed.
Ideally, we would want to see the trace & data being added as it is logged into the server.
There are two cases
- Avoid loss of logs in case of errors
- Faster monitoring of results
Case one is only really intended for dev/test env as we don't plan on bugs in prod and have opik disabled on prod
Case two is for long running jobs, it's inefficient to wait 500s plus for the logs to come in. Also in case we want to abort runs
I'm looking into implementing it to an existing large code base, thus we are a little limited on implementation options
Let me know if you need anything g else
Thanks @gustavhartz , ill take this with our team and see how we can better improve this use case
Hi @gustavhartz, Thanks again for the suggestion!
We've synced internally and identified a technical solution that will allow trace and span visibility upon submission, without needing to wait for task completion.
Since this requires several changes across our data ingestion workflow (both on the SDK and backend), we're planning to start work on it soon. Just to set expectations, this will take some time to complete (likely over the next few weeks)
We'll keep you posted as we make progress
sounds good 👍
Hi @gustavhartz
Thank you for raising this issue! I'm Andrés, a Principal Engineer at Comet, and I wanted to let you know that I’ll be taking this forward. My initial focus will be on creating a proof of concept to address the problem you've outlined.
I'll share updates here as I make progress. Please feel free to provide any additional context that you think would be helpful.
Looking forward to working on this!
Best, Andrés
Sounds good @andrescrz! I think all is captured here
Hi @gustavhartz,
The first change at the trace level has been successfully implemented in the backend. The next step will be to propagate these changes to the spans.
Once that is completed, we’ll proceed to schedule the SDK changes.
Best regards, Andrés
Hi @gustavhartz,
Thank you for your patience. I wanted to let you know that support for this feature has been fully implemented in the backend. We are planning to schedule the SDK implementation for the next sprint.
We will keep you updated on our progress.
Best regards, Andrés
This change is highly anticipated!
This change is highly anticipated!
Thanks @stefanadelbert! I'm glad to say that it will be available soon (under a feature flag). The Python SDK part for traces is on PR already: https://github.com/comet-ml/opik/pull/2333 and then the same PR will come after for spans (work already in progress).
Hi everyone,
We’re excited to share that the recent changes to the Python SDK have been merged. These updates are currently behind a feature flag named log_start_trace_span, which is disabled by default for now.
We’ll be releasing and deploying these changes very soon. Following that, our plan is to enable the feature flag by default during the following week.
Thank you for your patience and support. Stay tuned for more updates!
Best regards, Andrés
Hi everyone!
Support for long-running traces and spans has been improved and is now available in the latest release. You can update the OPIK SDK to start using it. Please note that this feature is not enabled by default yet, but you can activate it by setting the appropriate environment variable, as shown in the following snippet:
import os
os.environ["OPIK_LOG_START_TRACE_SPAN"] = "True"
Happy coding!
Iaroslav
Thanks! The Opik release version where this is available is 1.7.34. which is also fully deployed to comet.com.
We'll work to enable the flag by default in the upcoming weeks.
Stay tuned!
I've just tried this feature out. I set the environment variable (export OPIK_LOG_START_TRACE_SPAN=True) and then created an Opik client, created a trace, added some spans and then closed the trace. I was able to see the updates in the Opik backend after each step, which is what this change is all about. Good stuff.