Improve benchmark integration with langfuse/langsmith
Is your feature request related to a problem? Please describe.
Related to #455. Currently implemented integration with langfuse and langsmith could be improved.
Describe the solution you'd like
Proposed improvements:
Tracking platform (CPU/GPU etc.) and what compute resource is currently used by the model:
- Purpose: To reliably compare the latency between the models
- Effort: Low
Adding task ID
- Purpose: Sometimes I wanted to check tracking for a specific task and it was hard to search it. It can be useful for comparing models’ performance for a specific task
- Effort: Low
Tracking commit hash
- Purpose: to compare different versions of RAI, check performance after some changes in code, make sure that we compare something reliably
- Effort: Low
Tracking session ID
- Purpose: To easily filter one benchmark run from all runs.
- Effort: Low
Introduce Error Codes
- Purpose: To easily filter errrors/pass them to fine-tuning workflow
- Effort: Low
Additional context
Errors included in Comment column could have error category IDs. (screen from Langfuse)
Session column in langfuse could be used (in langsmith other mechanism could be used, I didn't notice explicit session id there)
@MagdalenaKotynia what's the current timeline for this task? Anyone working on this?
@maciejmajek I haven't started working on this task. This enhancement proposal was created as a future enhancement after more priority work on the tool calling benchmark is done. I suggest to start working on it after the refactor of rai_bench in #517 is finished and merged. I edited the issue and added some other proposed improvements related to tracing.
I'm removing this issue from RAI 2.0, due to time constraints as well as low priority of the task
applied here https://github.com/RobotecAI/rai/pull/606