[Bug]: Opik Optimizer runs still show "running"
What component(s) are affected?
- [ ] Opik Python SDK
- [ ] Opik Typescript SDK
- [x] Opik Agent Optimizer SDK
- [x] Opik UI
- [x] Opik Server
- [ ] Documentation
Opik version
- Opik version: current main branch
If we close a running Opik Optimizer job that is not "completed" it remains as running in DB/UI. We should auto-clean/expire a job if stuck in this state for few hours or max 24 (1 day).
Describe the problem
UI is messy with in-progress jobs that are completed
Reproduction steps and code snippets
Close a running Opik Optimizer job mid-way and return to UI and will see "In Progress"
Error logs or stack trace
No response
Healthcheck results
No response
fyi @dsblank @jverre
@vincentkoc can you please assign this issue to me I want to work on this issue.
HI @vincentkoc @dsblank @YarivHashaiComet @jverre,
I am able to reproduce this issue on my local setup. Whenever an optimization process terminates abruptly (e.g., crash, kill, OS exit), the optimization record remains stuck in status = 'running', and last_updated_at in "optimizations" table does not refresh. because the backend never receives a final status update.
I would like to work on this. Could you please assign this issue to me?
Following is my proposed approach to fix this issue
1. Add backend-side heartbeat mechanism a) Introduce a lightweight heartbeat API call that the SDK periodically triggers during optimization runs. b) Each heartbeat updates only the last_updated_at column in the optimizations table for the same optimization ID. This ensures the UI and ClickHouse reflect that the optimizer is still alive.
2. Add “stuck optimization” auto-cleanup Implement a backend watchdog (scheduled task or periodic coroutine) that: a) scans for optimizations in status = 'running' b) checks if last_updated_at is older than a threshold (e.g., 30 minutes – 24 hours) c) automatically marks them as cancelled (or failed) This resolves zombie optimizations that were never cleanly closed.
3. Modify server-side update logic Ensure the server updates last_updated_at whenever: a) a trial experiment belonging to that optimization is created b) an optimization update request arrives This allows ongoing activity (e.g., experiments being logged) to keep the optimization fresh.
Can you please suggest if this is right approach or not And if you have different approach in your mind ? I would like to work on this issue !!!
Thank you !!!
Hi @sarangtagad-git! 👋
Thank you so much for the detailed analysis and proposed solution! It's great to see that you've already reproduced the issue locally and identified the root cause.
Regarding your proposed approach, your 3-part solution is well thought out! Here's my feedback:
- Heartbeat mechanism ✅ Good idea This would work well for long-running optimizations Consider making the interval configurable
- Auto-cleanup watchdog ✅ Recommended The 24-hour threshold mentioned in the original issue seems reasonable This could be implemented as a scheduled task on the backend
- Server-side update logic ⚠️ Nice to have Clever idea, but some optimizers might have long gaps between experiment creation This could be a nice-to-have but probably shouldn't be the primary mechanism
I'm happy to assign this to you! Feel free to explore the codebase (backend service layer, optimizer SDK) and open a draft PR early so we can provide feedback along the way. 🚀 Looking forward to your contribution!
Hi @source-rashi! 👋
Thanks for your interest in contributing to Opik! I'm happy to assign this to you as well.
I see @sarangtagad-git has provided a detailed technical proposal above that outlines a potential approach to solving this issue. Feel free to work on this issue as well!
Looking forward to your contribution! 🚀
HI @andrescrz ,
Thank you for your response. I am working on this !!!