omniperf icon indicating copy to clipboard operation
omniperf copied to clipboard

No such file or directory error

Open Ruturaj4 opened this issue 1 year ago • 2 comments

Using latest omniperf to run some xla tests.

GPU -mi300

omniperf profile -n scatt -- /grok/grok-1-rocm/xla/bazel-bin/xla/service/gpu/tests/select_and_scatter_test --gtest_filter=SelectAndScatterTest.SelectAndScatterPerformance
  ___                  _                  __
 / _ \ _ __ ___  _ __ (_)_ __   ___ _ __ / _|
| | | | '_ ` _ \| '_ \| | '_ \ / _ \ '__| |_
| |_| | | | | | | | | | | |_) |  __/ |  |  _|
 \___/|_| |_| |_|_| |_|_| .__/ \___|_|  |_|
                        |_|

Omniperf version: 2.0.0-RC1
Profiler choice: rocprofv2
Path: /grok/grok-1-rocm/xla/workloads/scatt/MI300X_A1
Target: MI300X_A1
Command: /grok/grok-1-rocm/xla/bazel-bin/xla/service/gpu/tests/select_and_scatter_test --gtest_filter=SelectAndScatterTest.SelectAndScatterPerformance
Kernel Selection: None
Dispatch Selection: None
IP Blocks: All

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Collecting Performance Counters
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[profiling] Current input file: /grok/grok-1-rocm/xla/workloads/scatt/MI300X_A1/perfmon/SQ_IFETCH_LEVEL.txt
   |-> [/opt/rocm-6.2.0-13796/bin/rocprofv2] /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US)
   |-> [/opt/rocm-6.2.0-13796/bin/rocprofv2] /opt/rocm-6.2.0-13796/bin/rocprofv2: line 301: /grok/grok-1-rocm/xla/bazel-bin/xla/service/gpu/tests/select_and_scatter_test --gtest_filter=SelectAndScatterTest.SelectAndScatterPerformance: No such file or directory
   |-> [/opt/rocm-6.2.0-13796/bin/rocprofv2]
ERROR Profiling execution failed.

However, the command works without omniperf!

Ruturaj4 avatar Apr 18 '24 01:04 Ruturaj4

@Ruturaj4 It looks like there was an issue setting local on this system to UTF-8. Particularly this function call: https://github.com/ROCm/omniperf/blob/0c8591ccca179e2f22cd4e402197434619be40f5/src/utils/utils.py#L607-L615

Could you try $ locale.setlocale(locale.LC_ALL, "en_US.UTF-8") manually to see if this was where our error came from

coleramos425 avatar Apr 18 '24 20:04 coleramos425

@Ruturaj4 It looks like there was an issue setting local on this system to UTF-8. Particularly this function call:

https://github.com/ROCm/omniperf/blob/0c8591ccca179e2f22cd4e402197434619be40f5/src/utils/utils.py#L607-L615

Could you try $ locale.setlocale(locale.LC_ALL, "en_US.UTF-8") manually to see if this was where our error came from

yeah, I tried that already. But looks like that is the same issue with rocprofv2 (I get the same warning from rocprofv2), however rocprofv2 works just fine.

Ruturaj4 avatar Apr 19 '24 16:04 Ruturaj4

Hi @coleramos425 I have similar problem on both MI200 and MI300. I can't call omniperf profile in this way: omniperf profile -n vcopy -- ./vcopy -n 1048576 -b 256 it give me error:

   INFO Omniperf version: 2.0.0
   INFO Profiler choice: rocprofv2
   INFO Path: /root/workspace_raid/omniperf/sample/workloads/vcopy/MI200
   INFO Target: MI200
   INFO Command: ./vcopy -n 1048576 -b 256
   INFO Kernel Selection: None
   INFO Dispatch Selection: None
   INFO Hardware Blocks: All
   INFO 
   INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   INFO Collecting Performance Counters
   INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   INFO 
   INFO [profiling] Current input file: /root/workspace_raid/omniperf/sample/workloads/vcopy/MI200/perfmon/SQ_IFETCH_LEVEL.txt
   INFO    |-> [rocprofv2] /usr/bin/rocprofv2: line 301: ./vcopy -n 1048576 -b 256: No such file or directory
   INFO    |-> [rocprofv2] 
  ERROR Profiling execution failed.

However, If I put the ./vcopy -n 1048576 -b 256 into a shell script, I call profile with: omniperf profile -n vcopy -- ./test.sh

My locale setting looks like that:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

BTW, I can use omniperf without problem yesterday but failed after docker container restart, so I think it's most likely a environment problem, but it's hard to figure out what cause that.

aska-0096 avatar May 21 '24 06:05 aska-0096

Hi @coleramos425 I have similar problem on both MI200 and MI300. I can't call omniperf profile in this way: omniperf profile -n vcopy -- ./vcopy -n 1048576 -b 256 it give me error:

   INFO Omniperf version: 2.0.0
   INFO Profiler choice: rocprofv2
   INFO Path: /root/workspace_raid/omniperf/sample/workloads/vcopy/MI200
   INFO Target: MI200
   INFO Command: ./vcopy -n 1048576 -b 256
   INFO Kernel Selection: None
   INFO Dispatch Selection: None
   INFO Hardware Blocks: All
   INFO 
   INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   INFO Collecting Performance Counters
   INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   INFO 
   INFO [profiling] Current input file: /root/workspace_raid/omniperf/sample/workloads/vcopy/MI200/perfmon/SQ_IFETCH_LEVEL.txt
   INFO    |-> [rocprofv2] /usr/bin/rocprofv2: line 301: ./vcopy -n 1048576 -b 256: No such file or directory
   INFO    |-> [rocprofv2] 
  ERROR Profiling execution failed.

However, If I put the ./vcopy -n 1048576 -b 256 into a shell script, I call profile with: omniperf profile -n vcopy -- ./test.sh

My locale setting looks like that:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

BTW, I can use omniperf without problem yesterday but failed after docker container restart, so I think it's most likely a environment problem, but it's hard to figure out what cause that.

I can solve this issue by

export ROCPROF=rocprof

But rocprofv2 is recommended right?

aska-0096 avatar May 21 '24 06:05 aska-0096

@aska-0096 if we assume the vcopy executable is being compiled properly and you can confirm that with a quick sanity check (i.e. ./vcopy -n 1048576 -b 256), I would guess that the docker container is being reloaded incorrectly?

BTW, I can use omniperf without problem yesterday but failed after docker container restart, so I think it's most likely a environment problem, but it's hard to figure out what cause that.

One common issue is that the container isn't being loaded with the proper permissions/groups. For reference I usually use:

$ docker run -it --network=host --device=/dev/kfd --device=/dev/dri/renderD128 --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined <image-id>

Give this a try and if you still face issues we can schedule a meeting to debug. Thanks.

coleramos425 avatar May 21 '24 13:05 coleramos425

I came across this issue too, it works after putting the command to be executed into a shell script. Could you help to fix this bug

bangtianliu avatar Jun 27 '24 21:06 bangtianliu

This issue is related to a rocprofv2 change, specifically, they now use exec to handle arg parsing in the latest versions of ROCm. A slight logic change was required on Omniperf's end to account for this. We've pushed a patch to our dev branch and the fix will be available in our next release

coleramos425 avatar Jul 03 '24 16:07 coleramos425