superbenchmark icon indicating copy to clipboard operation
superbenchmark copied to clipboard

sb should return non-zero exit code when executor.py failed

Open LiweiPeng opened this issue 3 years ago • 1 comments
trafficstars

What's the issue, what's expected?: This is using v0.6.0 release. The benchmark gemm-flops is run on a platform where the GPU is probably not supported (Tesla K80). The superbench has internal error like "Executor failed in gemm-flops, invalid context.". However, at the end, it returns exit code 0.

Expected: it should return non-zero exit code for this type of errors.

How to reproduce it?: On a VM with Tesla K80 GPU (or CPU), run gemm-flops benchmark.

Log message or shapshot?: [2022-09-09 20:38:20,578 N000000:38919][runner.py:392][INFO] Runner is going to run gemm-flops in local mode, proc rank 1. [2022-09-09 20:38:20,580 N000000:38919][ansible.py:107][INFO] Run docker exec --env-file /tmp/sb.env sb-workspace bash -c 'PROC_RANK=1 CUDA_VISIBLE_DEVICES=1 timeout 1200 sb exec --output-dir outputs/2022-09-09_20-38-14 -c sb.config.yaml -C superbench.enable=gemm-flops' on remote ... [2022-09-09 20:38:20,580 N000000:38919][ansible.py:72][INFO] Run as sudo ...

localhost | CHANGED | rc=0 >> [2022-09-09 20:38:22,577 N000000:246][executor.py:235][INFO] Executor is going to execute gemm-flops. [2022-09-09 20:38:23,363 N000000:246][registry.py:255][WARNING] Benchmark has no implementation, name: gemm-flops, platform: CPU [2022-09-09 20:38:23,364 N000000:246][executor.py:132][ERROR] Executor failed in gemm-flops, invalid context.

localhost | CHANGED | rc=0 >> [2022-09-09 20:38:22,702 N000000:260][executor.py:235][INFO] Executor is going to execute gemm-flops. [2022-09-09 20:38:23,479 N000000:260][registry.py:255][WARNING] Benchmark has no implementation, name: gemm-flops, platform: CPU [2022-09-09 20:38:23,479 N000000:260][executor.py:132][ERROR] Executor failed in gemm-flops, invalid context. [2022-09-09 20:38:23,731 N000000:38918][ansible.py:78][INFO] Run succeed, return code 0. [2022-09-09 20:38:23,860 N000000:38919][ansible.py:78][INFO] Run succeed, return code 0. [2022-09-09 20:38:23,862 N000000:38433][ansible.py:125][INFO] Run playbook fetch_results.yaml ...

PLAY [Fetch Results] ***********************************************************

TASK [Gathering Facts] ********************************************************* ok: [localhost]

TASK [Synchronize Output Directory] ******************************************** changed: [localhost]

PLAY RECAP ********************************************************************* localhost : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 [2022-09-09 20:38:26,514 N000000:38433][ansible.py:78][INFO] Run succeed, return code 0. [2022-09-09 20:38:26,516 N000000:38433][runner.py:256][ERROR] Invalid content in JSON file: /home/aiscadmin/superbench/outputs/2022-09-09_20-38-14/nodes/N000000/benchmarks/gemm-flops/rank0/results.json [2022-09-09 20:38:26,516 N000000:38433][runner.py:256][ERROR] Invalid content in JSON file: /home/aiscadmin/superbench/outputs/2022-09-09_20-38-14/nodes/N000000/benchmarks/gemm-flops/rank1/results.json 2022-09-09 20:38:26.746808: Command exit code: 0 Finished all. errors=0, runtime=13.1 s

Additional information:

LiweiPeng avatar Sep 09 '22 20:09 LiweiPeng

For Tesla K80 GPU, after I manually created nvidia-uvm, NVIDIA GPU is detected and used. executor.py failed because the GPU type is not supported. Again, 'sb' script didn't fail but it is expected to fail.

[2022-09-12 17:48:10,690 N000000:262][executor.py:235][INFO] Executor is going to execute gemm-flops. [2022-09-12 17:48:11,541 N000000:262][cuda_gemm_flops_performance.py:75][ERROR] Unsupported architecture - benchmark: gemm-flops, compute capability: 3.7, supports 7.0 7.5 8.0 8.6 [2022-09-12 17:48:11,542 N000000:262][executor.py:120][INFO] benchmark: gemm-flops, return code: 34, result: {'return_code': [34]}. [2022-09-12 17:48:11,542 N000000:262][executor.py:127][ERROR] Executor failed in gemm-flops. [2022-09-12 17:48:11,943 N000000:112306][ansible.py:78][INFO] Run succeed, return code 0. [2022-09-12 17:48:11,988 N000000:112307][ansible.py:78][INFO] Run succeed, return code 0. [2022-09-12 17:48:11,990 N000000:111757][ansible.py:125][INFO] Run playbook fetch_results.yaml ...

PLAY [Fetch Results] ***********************************************************

TASK [Gathering Facts] ********************************************************* by setting deprecation_warnings=False in ansible.cfg. ok: [localhost]

TASK [Synchronize Output Directory] ******************************************** changed: [localhost]

PLAY RECAP ********************************************************************* localhost : ok=2 changed=1 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0 [2022-09-12 17:48:14,851 N000000:111757][ansible.py:78][INFO] Run succeed, return code 0. 2022-09-12 17:48:15.170990: Command exit code: 0

LiweiPeng avatar Sep 12 '22 17:09 LiweiPeng