superbenchmark
superbenchmark copied to clipboard
sb should return non-zero exit code when executor.py failed
What's the issue, what's expected?: This is using v0.6.0 release. The benchmark gemm-flops is run on a platform where the GPU is probably not supported (Tesla K80). The superbench has internal error like "Executor failed in gemm-flops, invalid context.". However, at the end, it returns exit code 0.
Expected: it should return non-zero exit code for this type of errors.
How to reproduce it?: On a VM with Tesla K80 GPU (or CPU), run gemm-flops benchmark.
Log message or shapshot?: [2022-09-09 20:38:20,578 N000000:38919][runner.py:392][INFO] Runner is going to run gemm-flops in local mode, proc rank 1. [2022-09-09 20:38:20,580 N000000:38919][ansible.py:107][INFO] Run docker exec --env-file /tmp/sb.env sb-workspace bash -c 'PROC_RANK=1 CUDA_VISIBLE_DEVICES=1 timeout 1200 sb exec --output-dir outputs/2022-09-09_20-38-14 -c sb.config.yaml -C superbench.enable=gemm-flops' on remote ... [2022-09-09 20:38:20,580 N000000:38919][ansible.py:72][INFO] Run as sudo ...
localhost | CHANGED | rc=0 >> [2022-09-09 20:38:22,577 N000000:246][executor.py:235][INFO] Executor is going to execute gemm-flops. [2022-09-09 20:38:23,363 N000000:246][registry.py:255][WARNING] Benchmark has no implementation, name: gemm-flops, platform: CPU [2022-09-09 20:38:23,364 N000000:246][executor.py:132][ERROR] Executor failed in gemm-flops, invalid context.
localhost | CHANGED | rc=0 >> [2022-09-09 20:38:22,702 N000000:260][executor.py:235][INFO] Executor is going to execute gemm-flops. [2022-09-09 20:38:23,479 N000000:260][registry.py:255][WARNING] Benchmark has no implementation, name: gemm-flops, platform: CPU [2022-09-09 20:38:23,479 N000000:260][executor.py:132][ERROR] Executor failed in gemm-flops, invalid context. [2022-09-09 20:38:23,731 N000000:38918][ansible.py:78][INFO] Run succeed, return code 0. [2022-09-09 20:38:23,860 N000000:38919][ansible.py:78][INFO] Run succeed, return code 0. [2022-09-09 20:38:23,862 N000000:38433][ansible.py:125][INFO] Run playbook fetch_results.yaml ...
PLAY [Fetch Results] ***********************************************************
TASK [Gathering Facts] ********************************************************* ok: [localhost]
TASK [Synchronize Output Directory] ******************************************** changed: [localhost]
PLAY RECAP ********************************************************************* localhost : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 [2022-09-09 20:38:26,514 N000000:38433][ansible.py:78][INFO] Run succeed, return code 0. [2022-09-09 20:38:26,516 N000000:38433][runner.py:256][ERROR] Invalid content in JSON file: /home/aiscadmin/superbench/outputs/2022-09-09_20-38-14/nodes/N000000/benchmarks/gemm-flops/rank0/results.json [2022-09-09 20:38:26,516 N000000:38433][runner.py:256][ERROR] Invalid content in JSON file: /home/aiscadmin/superbench/outputs/2022-09-09_20-38-14/nodes/N000000/benchmarks/gemm-flops/rank1/results.json 2022-09-09 20:38:26.746808: Command exit code: 0 Finished all. errors=0, runtime=13.1 s
Additional information:
For Tesla K80 GPU, after I manually created nvidia-uvm, NVIDIA GPU is detected and used. executor.py failed because the GPU type is not supported. Again, 'sb' script didn't fail but it is expected to fail.
[2022-09-12 17:48:10,690 N000000:262][executor.py:235][INFO] Executor is going to execute gemm-flops. [2022-09-12 17:48:11,541 N000000:262][cuda_gemm_flops_performance.py:75][ERROR] Unsupported architecture - benchmark: gemm-flops, compute capability: 3.7, supports 7.0 7.5 8.0 8.6 [2022-09-12 17:48:11,542 N000000:262][executor.py:120][INFO] benchmark: gemm-flops, return code: 34, result: {'return_code': [34]}. [2022-09-12 17:48:11,542 N000000:262][executor.py:127][ERROR] Executor failed in gemm-flops. [2022-09-12 17:48:11,943 N000000:112306][ansible.py:78][INFO] Run succeed, return code 0. [2022-09-12 17:48:11,988 N000000:112307][ansible.py:78][INFO] Run succeed, return code 0. [2022-09-12 17:48:11,990 N000000:111757][ansible.py:125][INFO] Run playbook fetch_results.yaml ...
PLAY [Fetch Results] ***********************************************************
TASK [Gathering Facts] ********************************************************* by setting deprecation_warnings=False in ansible.cfg. ok: [localhost]
TASK [Synchronize Output Directory] ******************************************** changed: [localhost]
PLAY RECAP ********************************************************************* localhost : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 [2022-09-12 17:48:14,851 N000000:111757][ansible.py:78][INFO] Run succeed, return code 0. 2022-09-12 17:48:15.170990: Command exit code: 0