k2 icon indicating copy to clipboard operation
k2 copied to clipboard

Error in rnnt_loss_test_py

Open jtrmal opened this issue 2 years ago • 21 comments

$ CUDA_VISIBLE_DEVICES=0 ctest --rerun-failed --output-on-failure
Test project /home/jtrmal/projects/k2/build_debug
    Start 97: rnnt_loss_test_py
1/1 Test #97: rnnt_loss_test_py ................***Failed    3.42 sec
..F.Pruned with new ranges 2 : tensor([770.3518, 452.2017, 759.8420, 562.8477, 664.1088])
Pruned with old ranges 2 : tensor([770.2620, 451.0323, 758.0836, 564.5976, 664.1088])
Pruned with new ranges 7 : tensor([698.3528, 427.7394, 720.2049, 538.8741, 664.1088])
Pruned with old ranges 7 : tensor([695.5566, 427.7296, 719.7375, 534.9005, 664.1088])
Pruned with new ranges 12 : tensor([688.5497, 427.1318, 716.7771, 527.3295, 664.1088])
Pruned with old ranges 12 : tensor([688.5190, 427.1318, 716.7300, 527.1926, 664.1088])
Pruned with new ranges 17 : tensor([687.4325, 427.1087, 716.2193, 524.6537, 664.1088])
Pruned with old ranges 17 : tensor([687.4208, 427.1087, 716.2195, 524.6722, 664.1088])
Pruned with new ranges 2 : tensor([770.3518, 452.2017, 759.8420, 562.8477, 664.1088], device='cuda:0')
Pruned with old ranges 2 : tensor([770.2620, 451.0323, 758.0836, 564.5977, 664.1088], device='cuda:0')
Pruned with new ranges 7 : tensor([698.3528, 427.7394, 720.2049, 538.8741, 664.1088], device='cuda:0')
Pruned with old ranges 7 : tensor([695.5567, 427.7296, 719.7375, 534.9005, 664.1088], device='cuda:0')
Pruned with new ranges 12 : tensor([688.5497, 427.1318, 716.7771, 527.3295, 664.1088], device='cuda:0')
Pruned with old ranges 12 : tensor([688.5190, 427.1318, 716.7300, 527.1926, 664.1088], device='cuda:0')
Pruned with new ranges 17 : tensor([687.4325, 427.1087, 716.2193, 524.6537, 664.1088], device='cuda:0')
Pruned with old ranges 17 : tensor([687.4208, 427.1087, 716.2195, 524.6722, 664.1088], device='cuda:0')
Unpruned rnnt loss with regular rnnt : tensor([117.7035, 583.1506, 178.6128, 342.4715])
Pruned loss with range 2 : tensor([126.5516, 645.1305, 240.4490, 374.9182], dtype=torch.float64)
Pruned loss with range 7 : tensor([117.7035, 614.2900, 198.5655, 347.0386], dtype=torch.float64)
Pruned loss with range 12 : tensor([117.7035, 601.2673, 184.7332, 342.9748], dtype=torch.float64)
Pruned loss with range 17 : tensor([117.7035, 591.1936, 179.9721, 342.5152], dtype=torch.float64)
Pruned loss with range 22 : tensor([117.7035, 588.4237, 178.7730, 342.4716], dtype=torch.float64)
Pruned loss with range 27 : tensor([117.7035, 586.2511, 178.6456, 342.4716], dtype=torch.float64)
Pruned loss with range 32 : tensor([117.7035, 583.1505, 178.6393, 342.4716], dtype=torch.float64)
Pruned loss with range 37 : tensor([117.7035, 583.1505, 178.6138, 342.4716], dtype=torch.float64)
Pruned loss with range 42 : tensor([117.7035, 583.1505, 178.6129, 342.4716], dtype=torch.float64)
Pruned loss with range 47 : tensor([117.7035, 583.1505, 178.6129, 342.4716], dtype=torch.float64)
Unpruned rnnt loss with regular rnnt : tensor([117.7035, 583.1506, 178.6128, 342.4715], device='cuda:0')
Pruned loss with range 2 : tensor([126.5516, 645.1305, 240.4490, 374.9182], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 7 : tensor([117.7035, 614.2900, 198.5655, 347.0386], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 12 : tensor([117.7035, 601.2673, 184.7332, 342.9748], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 17 : tensor([117.7035, 591.1936, 179.9721, 342.5152], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 22 : tensor([117.7035, 588.4237, 178.7730, 342.4716], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 27 : tensor([117.7035, 586.2511, 178.6456, 342.4716], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 32 : tensor([117.7035, 583.1505, 178.6393, 342.4716], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 37 : tensor([117.7035, 583.1505, 178.6139, 342.4716], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 42 : tensor([117.7035, 583.1505, 178.6129, 342.4716], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 47 : tensor([117.7035, 583.1505, 178.6129, 342.4716], device='cuda:0',
       dtype=torch.float64)
Unpruned rnnt loss with modified rnnt : tensor([105.8454, 520.9167, 110.4065, 302.0487])
Pruned loss with range 2 : tensor([109.4327, 563.8055, 125.9352, 322.0668], dtype=torch.float64)
Pruned loss with range 7 : tensor([105.8454, 537.7171, 111.0337, 303.5203], dtype=torch.float64)
Pruned loss with range 12 : tensor([105.8454, 530.4149, 110.4277, 302.3482], dtype=torch.float64)
Pruned loss with range 17 : tensor([105.8454, 526.3236, 110.4066, 302.0535], dtype=torch.float64)
Pruned loss with range 22 : tensor([105.8454, 524.1243, 110.4065, 302.0488], dtype=torch.float64)
Pruned loss with range 27 : tensor([105.8454, 522.4050, 110.4065, 302.0488], dtype=torch.float64)
Pruned loss with range 32 : tensor([105.8454, 520.9166, 110.4065, 302.0488], dtype=torch.float64)
Pruned loss with range 37 : tensor([105.8454, 520.9166, 110.4065, 302.0488], dtype=torch.float64)
Pruned loss with range 42 : tensor([105.8454, 520.9166, 110.4065, 302.0488], dtype=torch.float64)
Pruned loss with range 47 : tensor([105.8454, 520.9166, 110.4065, 302.0488], dtype=torch.float64)
Unpruned rnnt loss with modified rnnt : tensor([105.8454, 520.9167, 110.4065, 302.0487], device='cuda:0')
Pruned loss with range 2 : tensor([109.4327, 563.8055, 125.9352, 322.0668], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 7 : tensor([105.8454, 537.7171, 111.0337, 303.5203], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 12 : tensor([105.8454, 530.4149, 110.4277, 302.3482], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 17 : tensor([105.8454, 526.3236, 110.4066, 302.0535], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 22 : tensor([105.8454, 524.1243, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 27 : tensor([105.8454, 522.4050, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 32 : tensor([105.8454, 520.9166, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 37 : tensor([105.8454, 520.9166, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 42 : tensor([105.8454, 520.9166, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 47 : tensor([105.8454, 520.9166, 110.4065, 302.0488], device='cuda:0',
       dtype=torch.float64)
Unpruned rnnt loss with constrained rnnt : tensor([118.3153, 590.9176, 210.3912, 346.9110])
Pruned loss with range 2 : tensor([125.0485, 637.4134, 236.0995, 368.7725], dtype=torch.float64)
Pruned loss with range 7 : tensor([118.3153, 610.0108, 211.3728, 348.9163], dtype=torch.float64)
Pruned loss with range 12 : tensor([118.3153, 602.3602, 210.4280, 347.3128], dtype=torch.float64)
Pruned loss with range 17 : tensor([118.3153, 596.3497, 210.3915, 346.9178], dtype=torch.float64)
Pruned loss with range 22 : tensor([118.3153, 594.3053, 210.3912, 346.9110], dtype=torch.float64)
Pruned loss with range 27 : tensor([118.3153, 592.7185, 210.3912, 346.9110], dtype=torch.float64)
Pruned loss with range 32 : tensor([118.3153, 590.9175, 210.3912, 346.9110], dtype=torch.float64)
Pruned loss with range 37 : tensor([118.3153, 590.9175, 210.3912, 346.9110], dtype=torch.float64)
Pruned loss with range 42 : tensor([118.3153, 590.9175, 210.3912, 346.9110], dtype=torch.float64)
Pruned loss with range 47 : tensor([118.3153, 590.9175, 210.3912, 346.9110], dtype=torch.float64)
Unpruned rnnt loss with constrained rnnt : tensor([118.3153, 590.9176, 210.3912, 346.9110], device='cuda:0')
Pruned loss with range 2 : tensor([125.0485, 637.4134, 236.0995, 368.7725], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 7 : tensor([118.3153, 610.0108, 211.3728, 348.9163], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 12 : tensor([118.3153, 602.3602, 210.4280, 347.3128], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 17 : tensor([118.3153, 596.3497, 210.3915, 346.9178], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 22 : tensor([118.3153, 594.3053, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 27 : tensor([118.3153, 592.7185, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)....
======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 844, in test_rnnt_loss_empty_reference
    assert torch.allclose(m, expected.to(device))
AssertionError

----------------------------------------------------------------------
Ran 8 tests in 2.342s

FAILED (failures=1)

Pruned loss with range 32 : tensor([118.3153, 590.9175, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 37 : tensor([118.3153, 590.9175, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 42 : tensor([118.3153, 590.9175, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)
Pruned loss with range 47 : tensor([118.3153, 590.9175, 210.3912, 346.9110], device='cuda:0',
       dtype=torch.float64)
B = 2, T = 9, S = 2, C = 10
Unpruned rnnt loss with regular rnnt : tensor([22.1890, 13.5834])
Pruned loss with range 2 : tensor([22.1890, 14.5212], dtype=torch.float64)
Pruned loss with range 3 : tensor([22.1890, 13.5834], dtype=torch.float64)
Unpruned rnnt loss with regular rnnt : tensor([22.1890, 13.5834], device='cuda:0')
Pruned loss with range 2 : tensor([22.1890, 14.5212], device='cuda:0', dtype=torch.float64)
Pruned loss with range 3 : tensor([22.1890, 13.5834], device='cuda:0', dtype=torch.float64)
Unpruned rnnt loss with modified rnnt : tensor([19.7059,  9.4256])
Pruned loss with range 1 : tensor([21.3703, 11.4501], dtype=torch.float64)
Pruned loss with range 2 : tensor([19.7059,  9.7360], dtype=torch.float64)
Pruned loss with range 3 : tensor([19.7059,  9.4256], dtype=torch.float64)
Unpruned rnnt loss with modified rnnt : tensor([19.7059,  9.4256], device='cuda:0')
Pruned loss with range 1 : tensor([21.3703, 11.4501], device='cuda:0', dtype=torch.float64)
Pruned loss with range 2 : tensor([19.7059,  9.7360], device='cuda:0', dtype=torch.float64)
Pruned loss with range 3 : tensor([19.7059,  9.4256], device='cuda:0', dtype=torch.float64)
Unpruned rnnt loss with constrained rnnt : tensor([22.1890, 13.9861])
Pruned loss with range 1 : tensor([inf, inf], dtype=torch.float64)
Pruned loss with range 2 : tensor([22.1890, 14.4814], dtype=torch.float64)
Pruned loss with range 3 : tensor([22.1890, 13.9861], dtype=torch.float64)
Unpruned rnnt loss with constrained rnnt : tensor([22.1890, 13.9861], device='cuda:0')
Pruned loss with range 1 : tensor([inf, inf], device='cuda:0', dtype=torch.float64)
Pruned loss with range 2 : tensor([22.1890, 14.4814], device='cuda:0', dtype=torch.float64)
Pruned loss with range 3 : tensor([22.1890, 13.9861], device='cuda:0', dtype=torch.float64)


0% tests passed, 1 tests failed out of 1

Total Test time (real) =   3.43 sec

The following tests FAILED:
	 97 - rnnt_loss_test_py (Failed)
Errors while running CTest

CUDA 11.7 CuDNN 8.7.0.84 happens both Release and Debug k2 from git master gcc: gcc (Debian 10.2.1-6) 10.2.1 20210110

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

any idea?

jtrmal avatar Feb 15 '23 21:02 jtrmal

And sorry for being terse

jtrmal avatar Feb 15 '23 21:02 jtrmal

Could you change

assert torch.allclose(m, expected.to(device))

to

assert torch.allclose(m, expected.to(device)), (m - expected.to(device)).abs().max()

so that it prints out some information on assertion failure.

If the value is very small, e.g., 0.001, we can ignore it.

csukuangfj avatar Feb 15 '23 22:02 csukuangfj

BTW torch.testing.assert_close prints out better diagnostic info (how many elements mismatch, by how much, etc)

pzelasko avatar Feb 15 '23 23:02 pzelasko

fangjun's code gave me this:

======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 845, in test_rnnt_loss_empty_reference
    assert torch.allclose(m, expected.to(device)), (
AssertionError: tensor(1.1028, device='cuda:0')

I tried also piotr's suggestion but that didn't provide any info, only threw AssertionError

======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 848, in test_rnnt_loss_empty_reference
    assert torch.testing.assert_close(m, expected.to(device))
AssertionError

----------------------------------------------------------------------

The code was

assert torch.testing.assert_close(m, expected.to(device))

jtrmal avatar Feb 16 '23 14:02 jtrmal

I tried this code:

                assert torch.allclose(m, expected.to(device)), (
                    m,
                    expected,
                    m - expected.to(device),
                )
                

and the output was

======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 845, in test_rnnt_loss_empty_reference
    assert torch.allclose(m, expected.to(device)), (
AssertionError: (tensor([0.], device='cuda:0'), tensor([1.1028]), tensor([-1.1028], device='cuda:0'))

----------------------------------------------------------------------

jtrmal avatar Feb 16 '23 14:02 jtrmal

However, if I do something like this

850                 assert torch.testing.assert_close(
851                     m,
852                     expected.to(device),
853                     check_layout=False,
854                     check_device=False,
855                     check_dtype=False,
856                 ), (m, expected.to(device), (m - expected.to(device)))

I get this output

======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 850, in test_rnnt_loss_empty_reference
    assert torch.testing.assert_close(
AssertionError: (tensor([1.1028]), tensor([1.1028]), tensor([0.]))

----------------------------------------------------------------------
Ran 8 tests in 2.332s

I'm so confused

jtrmal avatar Feb 16 '23 14:02 jtrmal

Does it look like some timing/kernel sync issue?

jtrmal avatar Feb 16 '23 14:02 jtrmal

export K2_DISABLE_CHECKS=0
export K2_SYNC_KERNELS=1
export CUDA_LAUNCH_BLOCKING=1

didn't change the behavior, tho

jtrmal avatar Feb 16 '23 14:02 jtrmal

it did succeed on CPU, I think:

$ CUDA_VISIBLE_DEVICES= ctest --rerun-failed --output-on-failure
Test project /home/jtrmal/projects/k2/build_debug
    Start 97: rnnt_loss_test_py
1/1 Test #97: rnnt_loss_test_py ................   Passed    1.52 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   1.53 sec

jtrmal avatar Feb 16 '23 14:02 jtrmal

sorry for spamming :/

jtrmal avatar Feb 16 '23 14:02 jtrmal

Please change assert torch.testing.assert_close( to torch.testing.assert_close(, the actual assertion is inside, then you'll see more info

pzelasko avatar Feb 16 '23 15:02 pzelasko

ah!

jtrmal avatar Feb 16 '23 15:02 jtrmal

======================================================================
FAIL: test_rnnt_loss_empty_reference (__main__.TestRnntLoss)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jtrmal/projects/k2/k2/python/tests/rnnt_loss_test.py", line 850, in test_rnnt_loss_empty_reference
    torch.testing.assert_close(
  File "/home/jtrmal/.local/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1342, in assert_close
    assert_equal(
  File "/home/jtrmal/.local/lib/python3.9/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 1 (100.0%)
Greatest absolute difference: 1.1028387546539307 at index (0,) (up to 1e-05 allowed)
Greatest relative difference: 1.0 at index (0,) (up to 1.3e-06 allowed)

----------------------------------------------------------------------
Ran 8 tests in 4.031s

FAILED (failures=1)

jtrmal avatar Feb 16 '23 15:02 jtrmal

Ooops, not helpful with a single element tensor 🙈

pzelasko avatar Feb 16 '23 15:02 pzelasko

tested out cudnn 8.3, 8.6,8.8 and could reproduce on all three

jtrmal avatar Feb 16 '23 18:02 jtrmal

We only added the ability to have an empty reference fairly recently so it's possible it was never properly tested then. Looking at that code, "expected" only seems to be written to if the device is CPU. [EDIT: I see now that this is now it is supposed to work, it is in a loop over device.]

danpovey avatar Feb 17 '23 08:02 danpovey

If you have time, one thing you could help to debug with is this: in mutual_information.py line 393, after the following line

    # note, tot_probs is without grad.                                                                                                                                                                                     
    tot_probs = _k2.mutual_information_forward(px_tot, py_tot, boundary, p)

print out the value of p (p will get set by this function call). This may generate a lot of output before it crashes; direct to a file if you want.

danpovey avatar Feb 17 '23 08:02 danpovey

it's 3090 will get in touch with Desh, if he can dig deeper than I can y.

On Fri, Feb 17, 2023 at 3:15 AM Daniel Povey @.***> wrote:

What is your hardware?

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/k2/issues/1160#issuecomment-1434279894, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACUKYX3SMYYSB2BPUAX7ZMDWX4XT7ANCNFSM6AAAAAAU5MAFDM . You are receiving this because you authored the thread.Message ID: @.***>

jtrmal avatar Feb 17 '23 13:02 jtrmal

@jtrmal I just tried running on CLSP grid with GPU and it passes:

10:31 $ pytest k2/python/tests/rnnt_loss_test.py
====================================================================== test session starts =======================================================================
platform linux -- Python 3.8.12, pytest-5.4.3, py-1.11.0, pluggy-0.13.1
rootdir: /export/c07/draj/mini_scale_2022/k2
plugins: typeguard-2.13.3, anyio-3.5.0, hypothesis-5.41.2
collected 8 items                                                                                                                                                

k2/python/tests/rnnt_loss_test.py ........                                                                                                                 [100%]

======================================================================== warnings summary ========================================================================
/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/stepwise.py:108
  /home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/stepwise.py:108: PytestCacheWarning: cache could not write path /export/c07/draj/mini_scale_2022/k2/.pytest_cache/v/cache/stepwise
    self.config.cache.set("cache/stepwise", [])

/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:366
  /home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:366: PytestCacheWarning: cache could not write path /export/c07/draj/mini_scale_2022/k2/.pytest_cache/v/cache/nodeids
    config.cache.set("cache/nodeids", self.cached_nodeids)

/home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:326
  /home/draj/anaconda3/envs/scale/lib/python3.8/site-packages/_pytest/cacheprovider.py:326: PytestCacheWarning: cache could not write path /export/c07/draj/mini_scale_2022/k2/.pytest_cache/v/cache/lastfailed
    config.cache.set("cache/lastfailed", self.lastfailed)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================================= 8 passed, 3 warnings in 41.40s =================================================================

(You can ignore the warnings --- the c07 node is read-only today due to some issues.)

desh2608 avatar Feb 17 '23 15:02 desh2608

OK, might be something specific to Yenda's setup or where he is running it. @jtrmal can you please add that print statement that I mentioned above? (I edited it, you may not see it from email)

danpovey avatar Feb 18 '23 05:02 danpovey

cpu.log gpu.log

I'm attaching logs from both cpu and gpu runs, obtained as

CUDA_VISIBLE_DEVICES=  ctest --rerun-failed       --verbose > cpu.log
CUDA_VISIBLE_DEVICES=0 ctest --rerun-failed       --verbose > gpu.log

CPU run succeeded, GPU failed Also, I had to modify the mutual_information.py file on line 160 -- the modification in line 397 didn't give any output (probably isn't called joint_mutual_information_recursion)

jtrmal avatar Feb 21 '23 15:02 jtrmal