speechbrain icon indicating copy to clipboard operation
speechbrain copied to clipboard

Failing tests

Open GAN-pie opened this issue 3 years ago • 11 comments

Hi, I have fresh installed speechbrain from 0.5.8 to the last release and I spotted 4 tests that are failing with weird assertions. I'm not 100% sure but I do not think these tests were failing on the older install.

Here is the tests output:

FAILED tests/unittests/test_features.py::test_features_multimic[cuda] - AssertionError: assert tensor(0.0008, device='cuda:0') < 5e-05
FAILED tests/unittests/test_profiling.py::test_profile_class[cuda] - assert 3 == 2
FAILED tests/unittests/test_profiling.py::test_scheduler[cuda] - assert 1 == 0
FAILED tests/unittests/test_profiling.py::test_aggregated_traces[cuda] - assert (4222 == 4209 or 4209 == 2993)

GAN-pie avatar Jun 23 '22 14:06 GAN-pie

@anautsch this looks like something is not working as intended on the profiling tests ?

TParcollet avatar Jun 23 '22 16:06 TParcollet

Hi @GAN-pie thanks for reaching out! I can assist on the tests for profiling. We provided a wrapper to benchmark VRAM peaks and real-time demands. The reported errors are from mismatching number of nested function calls. We have had similar issues over the past weeks with test_profiling as the number of nested function keeps changing with operating systems, library versions etc. - sorry for that. We use pytorch's profiling tools and wrap them. I've tested on cpu only - I see you run tests on cuda.

Which pytorch version are you running? (curiosity question)

@TParcollet these issues with the profiling tests start to get more annoying than that they help checking on the profiling tool - mainly as everything underlying keeps changing like a moving target. That in essence makes harmonised profiling & benchmarking questionable by itself since it really depends on everything then that is circumstantial to the actual module under test. It's all of the code, hardware, ... that is under test with profiling. Comparability across sites gathering such data can only be achieved on categorical/nominal scales.

Will update with reducing profiling test assertions -

anautsch avatar Jun 27 '22 11:06 anautsch

@GAN-pie please pull the latest develop and re-run.

On FAILED tests/unittests/test_features.py::test_features_multimic[cuda] someone else might be better suited to help you.

anautsch avatar Jun 27 '22 12:06 anautsch

@anautsch 2 out 3 test_profiling.py seem to work properly now, but still one that is messing with the assert 1 == 0 statement.

FAILED tests/unittests/test_profiling.py::test_scheduler[cuda] - assert 1 == 0

E       assert 1 == 0
E        +  where 1 = len([<FunctionEvent id=0 name=cudaDeviceSynchronize device_type=DeviceType.CPU node_id=-1 cpu_time=8.000us start_us=805 en...ead=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>])
E        +    where [<FunctionEvent id=0 name=cudaDeviceSynchronize device_type=DeviceType.CPU node_id=-1 cpu_time=8.000us start_us=805 en...ead=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>] = <bound method _KinetoProfile.events of <torch.profiler.profiler.profile object at 0x7f6e5d0475e0>>()
E        +      where <bound method _KinetoProfile.events of <torch.profiler.profiler.profile object at 0x7f6e5d0475e0>> = <torch.profiler.profiler.profile object at 0x7f6e5d0475e0>.events
E        +        where <torch.profiler.profiler.profile object at 0x7f6e5d0475e0> = <test_profiling.test_scheduler.<locals>.SimpleBrainUntracked object at 0x7f6e5d0dd820>.profiler

tests/unittests/test_profiling.py:351: AssertionError

And for your curiosity, I'm running torch 1.11

GAN-pie avatar Jun 27 '22 15:06 GAN-pie

@GAN-pie this is odd.

Right now, I can explain only what should happen there - obviously, it is not working for you. https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L351 is part of testing pytorch's profiling scheduler - which postpones recording of time/memory information until a few iterations happened (then, the algorithm execution is warmed-up and what should be cached is cached for efficient execution, and one benchmarks the part of algorithm execution that is running most of the time).

There are two settings tested before this one, which both work. a) With a scheduler for a Brain class, the recording needs one warm-up step and records then: https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L205 https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L214

b) With a scheduler for a function, that runs brain.fit() but without this outer scheduler being triggered (there are no events recorded and we can successfully check for an error that would result if we'd look for these events). https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L318

About the setting which is failing on your set-up (it had been working on our end with pytorch 1.11).

The scheduler is added after the class is initialised:

  class SimpleBrainUntracked(Brain):
    ...
  brain_or_pretrained = SimpleBrainUntracked({"model": model}, lambda x: SGD(x, 0.1), run_opts={"device": device})
  scheduled_profiler = schedule(profile)
  scheduled_profiler(brain_or_pretrained)

Then, the core.brain evaluation function is tested - which increments the profiler steps after each batch: https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/speechbrain/core.py#L1251

There, we check for different batch sizes (simulated by mock-up test sets of different shapes): https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L340 https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L350 https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L361 https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L373 ... and it is for the last one, where the profiler should be recording - before, there should be zero events recorded. https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L376

I documented debug information of the mapping from profiler steps to scheduled profiler actions: https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/speechbrain/utils/profiling.py#L75

Can you please run these lines, and post what you get?

from torch import profiler
torch_scheduler = profiler.schedule(
    wait=2,
    warmup=2,
    active=2,
    repeat=1,
    skip_first=0,
)
_=[print("step {}: {}".format(i, torch_scheduler(x))) for i, x in enumerate(range(10))]

On my laptop with pytorch==1.11.0+cu102:

step 0: ProfilerAction.NONE step 1: ProfilerAction.NONE step 2: ProfilerAction.WARMUP step 3: ProfilerAction.WARMUP step 4: ProfilerAction.RECORD step 5: ProfilerAction.RECORD_AND_SAVE step 6: ProfilerAction.NONE step 7: ProfilerAction.NONE step 8: ProfilerAction.NONE step 9: ProfilerAction.NONE

anautsch avatar Jun 28 '22 08:06 anautsch

@anautsch outputs are exactly the same on my computer pytorch=1.11+cuda=11.3.1

Also, I tried to run pytest without the cuda flag and it returns the same assertion error.

GAN-pie avatar Jun 28 '22 09:06 GAN-pie

@GAN-pie ok, at least that - please let us try one more remote debugging.

import torch
from pytest import raises
from torch.optim import SGD
from speechbrain.core import Brain
from speechbrain.utils.profiling import profile, schedule

device = 'cpu'  # another run with 'cuda:0'

model = torch.nn.Linear(in_features=10, out_features=10, device=device)
inputs = torch.rand(10, 10, device=device)
targets = torch.rand(10, 10, device=device)

class SimpleBrainUntracked(Brain):
    def compute_forward(self, batch, stage):
        return self.modules.model(batch[0])
    def compute_objectives(self, predictions, batch, stage):
        return torch.nn.functional.l1_loss(predictions, batch[1])

brain_or_pretrained = SimpleBrainUntracked({"model": model}, lambda x: SGD(x, 0.1), run_opts={"device": device})
scheduled_profiler = schedule(profile)
scheduled_profiler(brain_or_pretrained)
brain_or_pretrained.evaluate(
    test_set=(
        [inputs, targets],  # +1x test_set -> step_num=2
        [inputs, targets],
    )
)

print(brain_or_pretrained.profiler.events())

It should give an empty line. You can also try:

brain_or_pretrained.profiler.events() [] brain_or_pretrained.profiler.profiler []

On the contrary, with:

brain_or_pretrained.evaluate(
    test_set=(
        [inputs, targets],  # +1x test_set -> step_num=2
        [inputs, targets],
        [inputs, targets],
        [inputs, targets],
    )
)

print(brain_or_pretrained.profiler.events())

Results in a benchmark table and the other two lines give a list of FunctionEvents back. I'm curious how it looks on your side. I'll have to test it later with our setting. Want to be sure, I reproduce it correctly...

anautsch avatar Jun 28 '22 10:06 anautsch

On my side it behaves differently.

>>> print(brain_or_pretrained.profiler.events()) does not gives an empty line, but the following:

---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
         cudaGetDeviceCount         0.03%     500.000us         0.03%     500.000us     500.000us             1  
         cudaGetDeviceCount         0.00%       0.000us         0.00%       0.000us       0.000us             1  
    cudaGetDeviceProperties         0.01%     116.000us         0.01%     116.000us     116.000us             1  
    cudaGetDeviceProperties         0.01%     105.000us         0.01%     105.000us     105.000us             1  
      cudaDeviceSynchronize        99.96%        1.921s        99.96%        1.921s        1.921s             1  
---------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.921s

and brain_or_pretrained.profiler.profiler returns:

[<FunctionEvent id=0 name=cudaGetDeviceCount device_type=DeviceType.CPU node_id=-1 cpu_time=500.000us start_us=32 end_us=532 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceCount thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceCount device_type=DeviceType.CPU node_id=-1 cpu_time=0.000us start_us=547 end_us=547 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceCount thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceProperties device_type=DeviceType.CPU node_id=-1 cpu_time=116.000us start_us=551 end_us=667 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceProperties thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceProperties device_type=DeviceType.CPU node_id=-1 cpu_time=105.000us start_us=726 end_us=831 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceProperties thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaDeviceSynchronize device_type=DeviceType.CPU node_id=-1 cpu_time=1.921s start_us=876 end_us=1921557 cpu_children=[] cuda_time=0.000us name=cudaDeviceSynchronize thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>]

However, when using:

brain_or_pretrained.evaluate(
    test_set=(
        [inputs, targets],  # +1x test_set -> step_num=2
        [inputs, targets],
        [inputs, targets],
        [inputs, targets],
    )
)

It returns a float: 0.682352602481842 Then print(brain_or_pretrained.profiler.events()) gives me:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            aten::zeros         0.00%       8.000us         0.00%      10.000us      10.000us             1  
                                            aten::empty         0.00%       1.000us         0.00%       1.000us       1.000us             1  
                                            aten::zero_         0.00%       1.000us         0.00%       1.000us       1.000us             1  
                                          ProfilerStep*         0.01%     143.000us         0.01%     152.000us     152.000us             1  
                                            aten::empty         0.00%       1.000us         0.00%       1.000us       1.000us             1  
                                            aten::zeros         0.00%       1.000us         0.00%       3.000us       3.000us             1  
                                            aten::empty         0.00%       2.000us         0.00%       2.000us       2.000us             1  
                                            aten::zero_         0.00%       0.000us         0.00%       0.000us       0.000us             1  
enumerate(DataLoader)#_SingleProcessDataLoaderIter._...         0.00%       5.000us         0.00%       5.000us       5.000us             1  
                                            aten::empty         0.00%       0.000us         0.00%       0.000us       0.000us             1  
                                     cudaGetDeviceCount         0.03%     499.000us         0.03%     499.000us     499.000us             1  
                                     cudaGetDeviceCount         0.00%       0.000us         0.00%       0.000us       0.000us             1  
                                cudaGetDeviceProperties         0.01%     120.000us         0.01%     120.000us     120.000us             1  
                                cudaGetDeviceProperties         0.01%     103.000us         0.01%     103.000us     103.000us             1  
                                  cudaDeviceSynchronize        99.95%        1.905s        99.95%        1.905s        1.905s             1  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.905s

Finally brain_or_pretrained.profiler.profiler outputs the following list:

[<FunctionEvent id=1 name=aten::zeros device_type=DeviceType.CPU node_id=-1 cpu_time=10.000us start_us=25 end_us=35 cpu_children=[2, 3] cuda_time=0.000us name=aten::zeros thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=2 name=aten::empty device_type=DeviceType.CPU node_id=-1 cpu_time=1.000us start_us=26 end_us=27 cpu_children=[] cuda_time=0.000us name=aten::empty thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=3 name=aten::zero_ device_type=DeviceType.CPU node_id=-1 cpu_time=1.000us start_us=34 end_us=35 cpu_children=[] cuda_time=0.000us name=aten::zero_ thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=4 name=ProfilerStep* device_type=DeviceType.CPU node_id=-1 cpu_time=152.000us start_us=40 end_us=192 cpu_children=[5, 6, 8, 9] cuda_time=0.000us name=ProfilerStep* thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=5 name=aten::empty device_type=DeviceType.CPU node_id=-1 cpu_time=1.000us start_us=41 end_us=42 cpu_children=[] cuda_time=0.000us name=aten::empty thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=6 name=aten::zeros device_type=DeviceType.CPU node_id=-1 cpu_time=3.000us start_us=46 end_us=49 cpu_children=[7] cuda_time=0.000us name=aten::zeros thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=7 name=aten::empty device_type=DeviceType.CPU node_id=-1 cpu_time=2.000us start_us=46 end_us=48 cpu_children=[] cuda_time=0.000us name=aten::empty thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=8 name=aten::zero_ device_type=DeviceType.CPU node_id=-1 cpu_time=0.000us start_us=49 end_us=49 cpu_children=[] cuda_time=0.000us name=aten::zero_ thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=9 name=enumerate(DataLoader)#_SingleProcessDataLoaderIter.__next__ device_type=DeviceType.CPU node_id=-1 cpu_time=5.000us start_us=51 end_us=56 cpu_children=[10] cuda_time=0.000us name=enumerate(DataLoader)#_SingleProcessDataLoaderIter.__next__ thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=10 name=aten::empty device_type=DeviceType.CPU node_id=-1 cpu_time=0.000us start_us=52 end_us=52 cpu_children=[] cuda_time=0.000us name=aten::empty thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceCount device_type=DeviceType.CPU node_id=-1 cpu_time=499.000us start_us=226 end_us=725 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceCount thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceCount device_type=DeviceType.CPU node_id=-1 cpu_time=0.000us start_us=737 end_us=737 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceCount thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceProperties device_type=DeviceType.CPU node_id=-1 cpu_time=120.000us start_us=740 end_us=860 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceProperties thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceProperties device_type=DeviceType.CPU node_id=-1 cpu_time=103.000us start_us=931 end_us=1034 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceProperties thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaDeviceSynchronize device_type=DeviceType.CPU node_id=-1 cpu_time=1.905s start_us=1081 end_us=1905643 cpu_children=[] cuda_time=0.000us name=cudaDeviceSynchronize thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>]

Hope it will help.

GAN-pie avatar Jun 28 '22 10:06 GAN-pie

Yes, we need to filer out the cuda events...

cudaGetDeviceCount cudaGetDeviceProperties cudaDeviceSynchronize

are not essential to the profiling for realtime & peak memory, although they take some time in this minimal example.

Have to think about what's better here - restrict the unit test to cpu-only or to filter out the cuda preliminary checks (only for the test - or in general). Wondering if this is a pytorch/cuda bug or feature for their profiling scheduler ;-)

The cudaDeviceSynchronize could also hint to an implementation issue (how we wrap the pytorch scheduler & profiler) - maybe profiler & model are on different devices in this scenario - the first setting (a) does virtually a very similar test, bit without the issue. There, @scheduler & @profiler decorators are used instead.

Thanks @GAN-pie !

anautsch avatar Jun 28 '22 11:06 anautsch

exactly same situation.

heunggyulee avatar Jun 30 '22 07:06 heunggyulee

@heunggyulee understood.

To not temper with the profiler, I'll adjust the test - teaser:

# brain_or_pretrained.profiler.profiler will be set (not None anymore)
# when run on cpu, there are no events - but cuda activities are recorded if existing
# see: https://github.com/speechbrain/speechbrain/issues/1469
if torch.profiler.ProfilerActivity.CUDA in brain_or_pretrained.profiler.activities:
    assert (
        len(
            set([x.name for x in brain_or_pretrained.profiler.profiler.function_events])
            - {'cudaGetDeviceCount', 'cudaGetDeviceProperties', 'cudaDeviceSynchronize'}
        )
    ) == 0
else:
    assert len(brain_or_pretrained.profiler.events()) == 0

lmk


About the other one

FAILED tests/unittests/test_features.py::test_features_multimic[cuda] - AssertionError: assert tensor(0.0008, device='cuda:0') < 5e-05

the unit test looks like this

from speechbrain.processing.features import Filterbank

compute_fbanks = Filterbank().to(device)
inputs = torch.rand([10, 101, 201], device=device)
output = compute_fbanks(inputs)
inputs_ch2 = torch.stack((inputs, inputs), -1)
output_ch2 = compute_fbanks(inputs_ch2)
output_ch2 = output_ch2[..., 0]
assert torch.sum(output - output_ch2) < 5e-05

What do you get for output and output_ch2 ?

Laptop cpu:

output
tensor([[[ -2.4998,  -3.4625,  -2.1935,  ...,   7.6033,   7.8886,   7.4454],
         [ -1.4686,  -5.4756,  -2.2520,  ...,   7.5517,   7.8166,   6.8990],
         [ -2.1951,  -8.5281,  -5.6952,  ...,   7.9044,   7.4020,   7.1556],
         ...,
         
output_ch2
tensor([[[ -2.4998,  -3.4625,  -2.1935,  ...,   7.6033,   7.8886,   7.4454],
         [ -1.4686,  -5.4756,  -2.2520,  ...,   7.5517,   7.8166,   6.8990],
         [ -2.1951,  -8.5281,  -5.6952,  ...,   7.9044,   7.4020,   7.1556],
         ...,
         
torch.sum(output - output_ch2)
tensor(4.9903e-06)

anautsch avatar Jun 30 '22 08:06 anautsch

Closing this issue as the PR fixing this has been merged. Please feel free to re-open.

anautsch avatar Sep 16 '22 14:09 anautsch