speechbrain
speechbrain copied to clipboard
Failing tests
Hi, I have fresh installed speechbrain from 0.5.8 to the last release and I spotted 4 tests that are failing with weird assertions. I'm not 100% sure but I do not think these tests were failing on the older install.
Here is the tests output:
FAILED tests/unittests/test_features.py::test_features_multimic[cuda] - AssertionError: assert tensor(0.0008, device='cuda:0') < 5e-05
FAILED tests/unittests/test_profiling.py::test_profile_class[cuda] - assert 3 == 2
FAILED tests/unittests/test_profiling.py::test_scheduler[cuda] - assert 1 == 0
FAILED tests/unittests/test_profiling.py::test_aggregated_traces[cuda] - assert (4222 == 4209 or 4209 == 2993)
@anautsch this looks like something is not working as intended on the profiling tests ?
Hi @GAN-pie thanks for reaching out! I can assist on the tests for profiling. We provided a wrapper to benchmark VRAM peaks and real-time demands. The reported errors are from mismatching number of nested function calls. We have had similar issues over the past weeks with test_profiling as the number of nested function keeps changing with operating systems, library versions etc. - sorry for that. We use pytorch's profiling tools and wrap them. I've tested on cpu only - I see you run tests on cuda.
Which pytorch version are you running? (curiosity question)
@TParcollet these issues with the profiling tests start to get more annoying than that they help checking on the profiling tool - mainly as everything underlying keeps changing like a moving target. That in essence makes harmonised profiling & benchmarking questionable by itself since it really depends on everything then that is circumstantial to the actual module under test. It's all of the code, hardware, ... that is under test with profiling. Comparability across sites gathering such data can only be achieved on categorical/nominal scales.
Will update with reducing profiling test assertions -
@GAN-pie please pull the latest develop and re-run.
On FAILED tests/unittests/test_features.py::test_features_multimic[cuda] someone else might be better suited to help you.
@anautsch 2 out 3 test_profiling.py seem to work properly now, but still one that is messing with the assert 1 == 0 statement.
FAILED tests/unittests/test_profiling.py::test_scheduler[cuda] - assert 1 == 0
E assert 1 == 0
E + where 1 = len([<FunctionEvent id=0 name=cudaDeviceSynchronize device_type=DeviceType.CPU node_id=-1 cpu_time=8.000us start_us=805 en...ead=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>])
E + where [<FunctionEvent id=0 name=cudaDeviceSynchronize device_type=DeviceType.CPU node_id=-1 cpu_time=8.000us start_us=805 en...ead=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>] = <bound method _KinetoProfile.events of <torch.profiler.profiler.profile object at 0x7f6e5d0475e0>>()
E + where <bound method _KinetoProfile.events of <torch.profiler.profiler.profile object at 0x7f6e5d0475e0>> = <torch.profiler.profiler.profile object at 0x7f6e5d0475e0>.events
E + where <torch.profiler.profiler.profile object at 0x7f6e5d0475e0> = <test_profiling.test_scheduler.<locals>.SimpleBrainUntracked object at 0x7f6e5d0dd820>.profiler
tests/unittests/test_profiling.py:351: AssertionError
And for your curiosity, I'm running torch 1.11
@GAN-pie this is odd.
Right now, I can explain only what should happen there - obviously, it is not working for you. https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L351 is part of testing pytorch's profiling scheduler - which postpones recording of time/memory information until a few iterations happened (then, the algorithm execution is warmed-up and what should be cached is cached for efficient execution, and one benchmarks the part of algorithm execution that is running most of the time).
There are two settings tested before this one, which both work. a) With a scheduler for a Brain class, the recording needs one warm-up step and records then: https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L205 https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L214
b) With a scheduler for a function, that runs brain.fit() but without this outer scheduler being triggered (there are no events recorded and we can successfully check for an error that would result if we'd look for these events). https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L318
About the setting which is failing on your set-up (it had been working on our end with pytorch 1.11).
The scheduler is added after the class is initialised:
class SimpleBrainUntracked(Brain):
...
brain_or_pretrained = SimpleBrainUntracked({"model": model}, lambda x: SGD(x, 0.1), run_opts={"device": device})
scheduled_profiler = schedule(profile)
scheduled_profiler(brain_or_pretrained)
Then, the core.brain evaluation function is tested - which increments the profiler steps after each batch:
https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/speechbrain/core.py#L1251
There, we check for different batch sizes (simulated by mock-up test sets of different shapes): https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L340 https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L350 https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L361 https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L373 ... and it is for the last one, where the profiler should be recording - before, there should be zero events recorded. https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/tests/unittests/test_profiling.py#L376
I documented debug information of the mapping from profiler steps to scheduled profiler actions: https://github.com/speechbrain/speechbrain/blob/1b5811cf1c60654723ff0f0b8b112ee832a6cdf3/speechbrain/utils/profiling.py#L75
Can you please run these lines, and post what you get?
from torch import profiler
torch_scheduler = profiler.schedule(
wait=2,
warmup=2,
active=2,
repeat=1,
skip_first=0,
)
_=[print("step {}: {}".format(i, torch_scheduler(x))) for i, x in enumerate(range(10))]
On my laptop with pytorch==1.11.0+cu102:
step 0: ProfilerAction.NONE step 1: ProfilerAction.NONE step 2: ProfilerAction.WARMUP step 3: ProfilerAction.WARMUP step 4: ProfilerAction.RECORD step 5: ProfilerAction.RECORD_AND_SAVE step 6: ProfilerAction.NONE step 7: ProfilerAction.NONE step 8: ProfilerAction.NONE step 9: ProfilerAction.NONE
@anautsch outputs are exactly the same on my computer pytorch=1.11+cuda=11.3.1
Also, I tried to run pytest without the cuda flag and it returns the same assertion error.
@GAN-pie ok, at least that - please let us try one more remote debugging.
import torch
from pytest import raises
from torch.optim import SGD
from speechbrain.core import Brain
from speechbrain.utils.profiling import profile, schedule
device = 'cpu' # another run with 'cuda:0'
model = torch.nn.Linear(in_features=10, out_features=10, device=device)
inputs = torch.rand(10, 10, device=device)
targets = torch.rand(10, 10, device=device)
class SimpleBrainUntracked(Brain):
def compute_forward(self, batch, stage):
return self.modules.model(batch[0])
def compute_objectives(self, predictions, batch, stage):
return torch.nn.functional.l1_loss(predictions, batch[1])
brain_or_pretrained = SimpleBrainUntracked({"model": model}, lambda x: SGD(x, 0.1), run_opts={"device": device})
scheduled_profiler = schedule(profile)
scheduled_profiler(brain_or_pretrained)
brain_or_pretrained.evaluate(
test_set=(
[inputs, targets], # +1x test_set -> step_num=2
[inputs, targets],
)
)
print(brain_or_pretrained.profiler.events())
It should give an empty line. You can also try:
brain_or_pretrained.profiler.events() [] brain_or_pretrained.profiler.profiler []
On the contrary, with:
brain_or_pretrained.evaluate(
test_set=(
[inputs, targets], # +1x test_set -> step_num=2
[inputs, targets],
[inputs, targets],
[inputs, targets],
)
)
print(brain_or_pretrained.profiler.events())
Results in a benchmark table and the other two lines give a list of FunctionEvents back. I'm curious how it looks on your side. I'll have to test it later with our setting. Want to be sure, I reproduce it correctly...
On my side it behaves differently.
>>> print(brain_or_pretrained.profiler.events()) does not gives an empty line, but the following:
--------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
--------------------------- ------------ ------------ ------------ ------------ ------------ ------------
cudaGetDeviceCount 0.03% 500.000us 0.03% 500.000us 500.000us 1
cudaGetDeviceCount 0.00% 0.000us 0.00% 0.000us 0.000us 1
cudaGetDeviceProperties 0.01% 116.000us 0.01% 116.000us 116.000us 1
cudaGetDeviceProperties 0.01% 105.000us 0.01% 105.000us 105.000us 1
cudaDeviceSynchronize 99.96% 1.921s 99.96% 1.921s 1.921s 1
--------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 1.921s
and brain_or_pretrained.profiler.profiler returns:
[<FunctionEvent id=0 name=cudaGetDeviceCount device_type=DeviceType.CPU node_id=-1 cpu_time=500.000us start_us=32 end_us=532 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceCount thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceCount device_type=DeviceType.CPU node_id=-1 cpu_time=0.000us start_us=547 end_us=547 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceCount thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceProperties device_type=DeviceType.CPU node_id=-1 cpu_time=116.000us start_us=551 end_us=667 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceProperties thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceProperties device_type=DeviceType.CPU node_id=-1 cpu_time=105.000us start_us=726 end_us=831 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceProperties thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaDeviceSynchronize device_type=DeviceType.CPU node_id=-1 cpu_time=1.921s start_us=876 end_us=1921557 cpu_children=[] cuda_time=0.000us name=cudaDeviceSynchronize thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>]
However, when using:
brain_or_pretrained.evaluate(
test_set=(
[inputs, targets], # +1x test_set -> step_num=2
[inputs, targets],
[inputs, targets],
[inputs, targets],
)
)
It returns a float: 0.682352602481842
Then print(brain_or_pretrained.profiler.events()) gives me:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
aten::zeros 0.00% 8.000us 0.00% 10.000us 10.000us 1
aten::empty 0.00% 1.000us 0.00% 1.000us 1.000us 1
aten::zero_ 0.00% 1.000us 0.00% 1.000us 1.000us 1
ProfilerStep* 0.01% 143.000us 0.01% 152.000us 152.000us 1
aten::empty 0.00% 1.000us 0.00% 1.000us 1.000us 1
aten::zeros 0.00% 1.000us 0.00% 3.000us 3.000us 1
aten::empty 0.00% 2.000us 0.00% 2.000us 2.000us 1
aten::zero_ 0.00% 0.000us 0.00% 0.000us 0.000us 1
enumerate(DataLoader)#_SingleProcessDataLoaderIter._... 0.00% 5.000us 0.00% 5.000us 5.000us 1
aten::empty 0.00% 0.000us 0.00% 0.000us 0.000us 1
cudaGetDeviceCount 0.03% 499.000us 0.03% 499.000us 499.000us 1
cudaGetDeviceCount 0.00% 0.000us 0.00% 0.000us 0.000us 1
cudaGetDeviceProperties 0.01% 120.000us 0.01% 120.000us 120.000us 1
cudaGetDeviceProperties 0.01% 103.000us 0.01% 103.000us 103.000us 1
cudaDeviceSynchronize 99.95% 1.905s 99.95% 1.905s 1.905s 1
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 1.905s
Finally brain_or_pretrained.profiler.profiler outputs the following list:
[<FunctionEvent id=1 name=aten::zeros device_type=DeviceType.CPU node_id=-1 cpu_time=10.000us start_us=25 end_us=35 cpu_children=[2, 3] cuda_time=0.000us name=aten::zeros thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=2 name=aten::empty device_type=DeviceType.CPU node_id=-1 cpu_time=1.000us start_us=26 end_us=27 cpu_children=[] cuda_time=0.000us name=aten::empty thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=3 name=aten::zero_ device_type=DeviceType.CPU node_id=-1 cpu_time=1.000us start_us=34 end_us=35 cpu_children=[] cuda_time=0.000us name=aten::zero_ thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=4 name=ProfilerStep* device_type=DeviceType.CPU node_id=-1 cpu_time=152.000us start_us=40 end_us=192 cpu_children=[5, 6, 8, 9] cuda_time=0.000us name=ProfilerStep* thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=5 name=aten::empty device_type=DeviceType.CPU node_id=-1 cpu_time=1.000us start_us=41 end_us=42 cpu_children=[] cuda_time=0.000us name=aten::empty thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=6 name=aten::zeros device_type=DeviceType.CPU node_id=-1 cpu_time=3.000us start_us=46 end_us=49 cpu_children=[7] cuda_time=0.000us name=aten::zeros thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=7 name=aten::empty device_type=DeviceType.CPU node_id=-1 cpu_time=2.000us start_us=46 end_us=48 cpu_children=[] cuda_time=0.000us name=aten::empty thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=8 name=aten::zero_ device_type=DeviceType.CPU node_id=-1 cpu_time=0.000us start_us=49 end_us=49 cpu_children=[] cuda_time=0.000us name=aten::zero_ thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=9 name=enumerate(DataLoader)#_SingleProcessDataLoaderIter.__next__ device_type=DeviceType.CPU node_id=-1 cpu_time=5.000us start_us=51 end_us=56 cpu_children=[10] cuda_time=0.000us name=enumerate(DataLoader)#_SingleProcessDataLoaderIter.__next__ thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=10 name=aten::empty device_type=DeviceType.CPU node_id=-1 cpu_time=0.000us start_us=52 end_us=52 cpu_children=[] cuda_time=0.000us name=aten::empty thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceCount device_type=DeviceType.CPU node_id=-1 cpu_time=499.000us start_us=226 end_us=725 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceCount thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceCount device_type=DeviceType.CPU node_id=-1 cpu_time=0.000us start_us=737 end_us=737 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceCount thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceProperties device_type=DeviceType.CPU node_id=-1 cpu_time=120.000us start_us=740 end_us=860 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceProperties thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaGetDeviceProperties device_type=DeviceType.CPU node_id=-1 cpu_time=103.000us start_us=931 end_us=1034 cpu_children=[] cuda_time=0.000us name=cudaGetDeviceProperties thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>, <FunctionEvent id=0 name=cudaDeviceSynchronize device_type=DeviceType.CPU node_id=-1 cpu_time=1.905s start_us=1081 end_us=1905643 cpu_children=[] cuda_time=0.000us name=cudaDeviceSynchronize thread=0 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>]
Hope it will help.
Yes, we need to filer out the cuda events...
cudaGetDeviceCount cudaGetDeviceProperties cudaDeviceSynchronize
are not essential to the profiling for realtime & peak memory, although they take some time in this minimal example.
Have to think about what's better here - restrict the unit test to cpu-only or to filter out the cuda preliminary checks (only for the test - or in general). Wondering if this is a pytorch/cuda bug or feature for their profiling scheduler ;-)
The cudaDeviceSynchronize could also hint to an implementation issue (how we wrap the pytorch scheduler & profiler) - maybe profiler & model are on different devices in this scenario - the first setting (a) does virtually a very similar test, bit without the issue. There, @scheduler & @profiler decorators are used instead.
Thanks @GAN-pie !
exactly same situation.
@heunggyulee understood.
To not temper with the profiler, I'll adjust the test - teaser:
# brain_or_pretrained.profiler.profiler will be set (not None anymore)
# when run on cpu, there are no events - but cuda activities are recorded if existing
# see: https://github.com/speechbrain/speechbrain/issues/1469
if torch.profiler.ProfilerActivity.CUDA in brain_or_pretrained.profiler.activities:
assert (
len(
set([x.name for x in brain_or_pretrained.profiler.profiler.function_events])
- {'cudaGetDeviceCount', 'cudaGetDeviceProperties', 'cudaDeviceSynchronize'}
)
) == 0
else:
assert len(brain_or_pretrained.profiler.events()) == 0
lmk
About the other one
FAILED tests/unittests/test_features.py::test_features_multimic[cuda] - AssertionError: assert tensor(0.0008, device='cuda:0') < 5e-05
the unit test looks like this
from speechbrain.processing.features import Filterbank
compute_fbanks = Filterbank().to(device)
inputs = torch.rand([10, 101, 201], device=device)
output = compute_fbanks(inputs)
inputs_ch2 = torch.stack((inputs, inputs), -1)
output_ch2 = compute_fbanks(inputs_ch2)
output_ch2 = output_ch2[..., 0]
assert torch.sum(output - output_ch2) < 5e-05
What do you get for output and output_ch2 ?
Laptop cpu:
output
tensor([[[ -2.4998, -3.4625, -2.1935, ..., 7.6033, 7.8886, 7.4454],
[ -1.4686, -5.4756, -2.2520, ..., 7.5517, 7.8166, 6.8990],
[ -2.1951, -8.5281, -5.6952, ..., 7.9044, 7.4020, 7.1556],
...,
output_ch2
tensor([[[ -2.4998, -3.4625, -2.1935, ..., 7.6033, 7.8886, 7.4454],
[ -1.4686, -5.4756, -2.2520, ..., 7.5517, 7.8166, 6.8990],
[ -2.1951, -8.5281, -5.6952, ..., 7.9044, 7.4020, 7.1556],
...,
torch.sum(output - output_ch2)
tensor(4.9903e-06)
Closing this issue as the PR fixing this has been merged. Please feel free to re-open.