scalene icon indicating copy to clipboard operation
scalene copied to clipboard

Scalene error: received signal SIGSEGV when using PyTorch on ROCm

Open Bengt opened this issue 2 years ago • 1 comments

Describe the bug When I run my training code written in PyTorch running on ROCm on an AMD GPU, I get an ominous error:

$ scalene training.py
Scalene error: received signal SIGSEGV 

When I run the same code with only CPU profiling, the error disappears:

$ scalene --cpu-only training.py

To Reproduce

Since my training code is rather large, I cannot with reasonable effort provide a minimal working example. However, note that simple PyTorch code actually works fine:

from torch import Tensor
from torch import rand


def pytorch_iterating_random_tensor():
    # Arrange
    dimension_0: int = 3
    dimension_1: int = 2

    # Act
    tensor: Tensor = rand(
        dimension_0,
        dimension_1,
    )

    # Assert
    assert isinstance(tensor, Tensor)
    for dimension_0_index in range(dimension_0):
        for dimension_1_index in range(dimension_1):
            assert 0 <= tensor[dimension_0_index][dimension_1_index] <= 1


if __name__ == '__main__':
    pytorch_iterating_random_tensor()

Expected behavior

I would have expected Scalene to run on a more complex PyTorch application, just like on the trivial application.

Desktop (please complete the following information):

  • OS: Ubuntu 22.04 + ROCm 5.2
  • Version: 1.5.14, current repository version tested, too

Additional context

I first see some of my prints and then the SegFault, so it seems likely that the initialization of ROCm/OpenML causes the issue in Scalene.

Bengt avatar Nov 08 '22 20:11 Bengt

According to the README, I believe only Nvidia GPUs are supported for profiling.

vmkalbskopf avatar Jun 08 '23 18:06 vmkalbskopf