rai Improve benchmark integration with langfuse/langsmith

Is your feature request related to a problem? Please describe.

Related to #455. Currently implemented integration with langfuse and langsmith could be improved.

Describe the solution you'd like

Proposed improvements:

Tracking platform (CPU/GPU etc.) and what compute resource is currently used by the model:

Purpose: To reliably compare the latency between the models
Effort: Low

Adding task ID

Purpose: Sometimes I wanted to check tracking for a specific task and it was hard to search it. It can be useful for comparing models’ performance for a specific task
Effort: Low

Tracking commit hash

Purpose: to compare different versions of RAI, check performance after some changes in code, make sure that we compare something reliably
Effort: Low

Tracking session ID

Purpose: To easily filter one benchmark run from all runs.
Effort: Low

Introduce Error Codes

Purpose: To easily filter errrors/pass them to fine-tuning workflow
Effort: Low

Additional context

Errors included in Comment column could have error category IDs. (screen from Langfuse)

Session column in langfuse could be used (in langsmith other mechanism could be used, I didn't notice explicit session id there)

Mar 21 '25 14:03 MagdalenaKotynia

@MagdalenaKotynia what's the current timeline for this task? Anyone working on this?

Apr 17 '25 15:04 maciejmajek

@maciejmajek I haven't started working on this task. This enhancement proposal was created as a future enhancement after more priority work on the tool calling benchmark is done. I suggest to start working on it after the refactor of rai_bench in #517 is finished and merged. I edited the issue and added some other proposed improvements related to tracing.

Apr 18 '25 08:04 MagdalenaKotynia

I'm removing this issue from RAI 2.0, due to time constraints as well as low priority of the task

May 03 '25 22:05 maciejmajek

applied here https://github.com/RobotecAI/rai/pull/606

Jun 09 '25 16:06 jmatejcz