test: Add gpqa tests for DeepSeek models
- Add gpqa accuracy test script
- Add gpqa accuracy tests
- Update DeepSeek-v3 doc
- Update qa test list
/bot run
PR_Github #427 [ run ] triggered by Bot
@LarryXFly for vis since this can introduce new test cases for QA pipeline.
June
@lfr-0531 BTW, I notice that the GPQA test will be run for at least twice, how long it take to finish a single round of GPQA evaluation in this setting?
June
@lfr-0531 BTW, I notice that the GPQA test will be run for at least twice, how long it take to finish a single round of GPQA evaluation in this setting?
June
Considering the running time, currently we only run it once. For the running time, if enabled MTP, it should stop after 20~30mins. But without MTP, we may need around 1h to finish the test.
PR_Github #427 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #366 completed with status: 'SUCCESS'
cc @syuoni for vis, let's consider supporting gpqa task in the ongoing accuracy suite.
cc @syuoni for vis, let's consider supporting gpqa task in the ongoing accuracy suite.
As more accuracy evaluation tasks are added to TRT-LLM, it seems more necessary to have them unified in a single entrypoint. cc @byshiue for vis.
/bot run
PR_Github #602 [ run ] triggered by Bot
PR_Github #602 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #511 completed with status: 'SUCCESS'
/bot run
PR_Github #641 [ run ] triggered by Bot
/bot kill
/bot run
PR_Github #644 [ run ] triggered by Bot
PR_Github #641 [ run ] completed with state ABORTED
PR_Github #644 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #543 completed with status: 'SUCCESS'
/bot run
PR_Github #650 [ run ] triggered by Bot
PR_Github #650 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #549 completed with status: 'SUCCESS'
/bot reuse-pipeline
PR_Github #652 [ reuse-pipeline ] triggered by Bot
PR_Github #652 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #650 for commit 6d15d8b