k2 k2::TopSorter::TopSort assertion, but only when using GPU

Using icefall/egs/librispeech/ASR/pruned_transducer_stateless7 recipe, using only train-clean-5 and dev-clean-2 to train a model, and running pruned_transducer_stateless7/decode.py on GPU with --decoding-method fast_beam_search_nbest_LG produces the following error.

[F] /home/user/k2/k2/csrc/top_sort.cu:324:k2::FsaVec k2::TopSorter::TopSort(k2::Array1<int>*) Check failed: start_state_present[0] == 1 (0 vs. 1) Our current implementation requires that the start state in each Fsa must be present in the first batch

However when pruned_transducer_stateless7/decode.py is forced to use the CPU, fast_beam_search_nbest_LG runs successfully.

Any suggestions what I might be doing wrong?

Image: nvcr.io/nvidia/pytorch:23.04-py3

k2.version:

k2 version: 1.24.3
Build type: Release
Git SHA1: fdb76bf4b3d9f28699eaf854b6b54e015b6b8a62
Git date: Wed May 24 23:51:07 2023
Cuda used to build k2: 12.1
cuDNN used to build k2: 
Python version used to build k2: 3.8
OS used to build k2: 
CMake version: 3.24.1
GCC version: 9.4.0
CMAKE_CUDA_FLAGS:  -Wno-deprecated-gpu-targets   -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_50,code=sm_50  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_60,code=sm_60  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_61,code=sm_61  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_70,code=sm_70  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_75,code=sm_75  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_80,code=sm_80  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 2.1.0a0+fe05266
PyTorch is using Cuda: 12.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/version/version.py
_k2.__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so

May 28 '23 00:05 rouseabout

@rouseabout

Are you using the latest icefall and have you made any changes to the code? Also, how did you generate LG.pt?

May 30 '23 03:05 csukuangfj

Icefall: Latest. https://github.com/k2-fsa/icefall/commit/1aeffa73bce3d4803ec52f0d17287ff65e280430

Code changes: Minimal to make it download and run on my limited hardware. Diff: https://github.com/rouseabout/icefall/commit/9e23b3826403c1e196b5249b74d56391d68a8760

prepare.sh: download librispeech mini instead of full
prepare.sh: use only dev-clean-2 and train-clean-5 dataset parts
prepare.sh: comment out download musan, prepare musan and compute fbank musan
prepare.sh: comment out building G_4_gram.fst.txt (it is not used by ./local/compile_lg.py)
pruned_transducer_stateless7/asr_datamodule.py: default --mini-libri to True and default --enable-musan to False
pruned_transducer_stateless7/decode.py: test against dev_clean_2_cuts() only

./prepare.sh was used to populate ./data folder and build ./data/lang_bpe_500/LG.pt. The LG.pt model appears to work on CPU.

-rw-r--r-- 1 user user 1226762 May 26 11:07 data/lang_bpe_500/LG.pt

May 30 '23 04:05 rouseabout

Can you make sure you ran any tests that are available in k2? Sorry I don't recall the details of how. It could be a bug in k2, relevant on your specific GPU or CPU hardware.

May 30 '23 06:05 danpovey

Guys especially @pkufool I noticed an issue in top_sort.cu. The comment for GetInitialBatch() says:

  /*                                                                                                                                                                                                                               
    Return the ragged array containing the states active on the 1st iteration of                                                                                                                                                   
    the algorithm.  These just correspond to the start-states of all                                                                                                                                                               
    the FSAs, and also the final-states for all FSAs in which final-states                                                                                                                                                         
    had in-degree zero (no arcs entering them).                                                                                                                                                                                    
                                                                                                                                                                                                                                   
    Note: in the originally published algorithm we start with all states                                                                                                                                                           
    that have in-degree zero, but in the context of this toolkit there                                                                                                                                                             
    is (I believe) no use in states that aren't accessible from the start                                                                                                                                                          
    state, so we remove them.                                                                                                                                                                                                      
   */

but the actual code does not do this, it actually just gets the states of in-degree 0, as in the original published algorithm. Actually this is probably OK (although I think we should change the comment to just say:

  /*                                                                                                                                                                                                                               
    Return the ragged array containing the states active on the 1st iteration of                                                                                                                                                   
    the algorithm.  These just correspond to all states                                                                                                                                                           
    that have in-degree zero.
*/

Notice that if the start-state has in-degree >0 (this is after removing self-loops), the start-state will not be included in the 1st batch. This is consistent with the documentation of TopSort. We must be careful to ensure that the input is acyclic. Can someone ensure that fast_beam_search_nbest_LG would always give acyclic input? (Note: the 1st/start state must have an arc coming into it for this assertion to come up, I think).
To avoid cycles, it may be necessary to add something to the "state" that is the number of symbols we have seen on this frame. I'm assuming right now that the "state" consists of: [state_in_LG, current_frame]; we could augment it to [state_in_LG, current_frame, num_syms_seen_this_frame].

May 30 '23 06:05 danpovey

OK，I will have a look.

May 30 '23 07:05 pkufool

Also, @rouseabout, can you try running it in pdb and getting a python stack trace when it fails? It would be nice to know for sure exactly when TopSort is being called.

May 30 '23 09:05 danpovey

I suspect the top_sort is in https://github.com/k2-fsa/icefall/blob/7b0afbdc16066701759e088f7edbb648a0b879f0/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py#L213

I paste the code here (the top_sort is in the last 5th line), can you dump the problemic lattice, you can do it with torch.save(lattice.as_dict(), file_name.pt).

lattice = fast_beam_search(
      model=model,
      decoding_graph=decoding_graph,
      encoder_out=encoder_out,
      encoder_out_lens=encoder_out_lens,
      beam=beam,
      max_states=max_states,
      max_contexts=max_contexts,
      temperature=temperature,
  )

  nbest = Nbest.from_lattice(
      lattice=lattice,
      num_paths=num_paths,
      use_double_scores=use_double_scores,
      nbest_scale=nbest_scale,
  )

  # The following code is modified from nbest.intersect()
  word_fsa = k2.invert(nbest.fsa)
  if hasattr(lattice, "aux_labels"):
      # delete token IDs as it is not needed
      del word_fsa.aux_labels
  word_fsa.scores.zero_()
  word_fsa_with_epsilon_loops = k2.linear_fsa_with_self_loops(word_fsa)
  path_to_utt_map = nbest.shape.row_ids(1)

  if hasattr(lattice, "aux_labels"):
      # lattice has token IDs as labels and word IDs as aux_labels.
      # inv_lattice has word IDs as labels and token IDs as aux_labels
      inv_lattice = k2.invert(lattice)
      inv_lattice = k2.arc_sort(inv_lattice)
  else:
      inv_lattice = k2.arc_sort(lattice)

  if inv_lattice.shape[0] == 1:
      path_lattice = k2.intersect_device(
          inv_lattice,
          word_fsa_with_epsilon_loops,
          b_to_a_map=torch.zeros_like(path_to_utt_map),
          sorted_match_a=True,
      )
  else:
      path_lattice = k2.intersect_device(
          inv_lattice,
          word_fsa_with_epsilon_loops,
          b_to_a_map=path_to_utt_map,
          sorted_match_a=True,
      )

  # path_lattice has word IDs as labels and token IDs as aux_labels
  path_lattice = k2.top_sort(k2.connect(path_lattice))
  tot_scores = path_lattice.get_tot_scores(
      use_double_scores=use_double_scores,
      log_semiring=True,  # Note: we always use True
  )

May 30 '23 09:05 pkufool

Thanks for looking into this.

Quick note setup.py disables building the C++ tests. I suggest changing this.

extra_cmake_args += " -DK2_ENABLE_TESTS=OFF "

After rebuilding, I can see 2 C++ tests are failing. All python tests are passing.

user@0411f528f430:~/k2/build/temp.linux-x86_64-3.8$ ctest
Test project /home/user/k2/build/temp.linux-x86_64-3.8
        Start   1: Test.Cuda.cu_algorithms_test
  1/111 Test   #1: Test.Cuda.cu_algorithms_test .................   Passed    1.62 sec
[...]
110/111 Test #110: Test.Cuda.cu_k2_torch_wave_reader_test .......   Passed    0.56 sec
        Start 111: Test.torch_api_test
111/111 Test #111: Test.torch_api_test ..........................   Passed    0.57 sec

98% tests passed, 2 tests failed out of 111

Total Test time (real) = 349.05 sec

The following tests FAILED:
         10 - Test.Cuda.cu_hash_test (Failed)
        109 - Test.Cuda.cu_k2_torch_parse_options_test (Failed)
Errors while running CTest
Output from these tests are in: /home/user/k2/build/temp.linux-x86_64-3.8/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

user@0411f528f430:~/k2$ pytest k2/python/tests
============================================= test session starts =============================================
platform linux -- Python 3.8.10, pytest-7.3.1, pluggy-1.0.0
rootdir: /home/user/k2
plugins: typeguard-4.0.0, xdist-3.2.1, shard-0.1.2, hypothesis-5.35.1, xdoctest-1.0.2, rerunfailures-11.1.2
collected 233 items                                                                                           
Running 233 items in this shard
[...]
============================================ 233 passed in 45.95s =============================================

Hardware: NVIDIA Corporation GP104GL [Tesla P4] (rev a1) Driver: Driver Version: 530.41.03 CUDA Version: 12.1

Stack trace from ./pruned_transducer_stateless7/decode.py:

[ Stack-Trace: ]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2_log.so(k2::internal::GetStackTrace[abi:cxx11]()+0x58) [0x7f94b9486538]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2context.so(k2::internal::Logger::~Logger()+0x5a) [0x7f94b9b3ac3a]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2context.so(k2::TopSorter::TopSort(k2::Array1<int>*)+0x46a) [0x7f94b9f286da]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/lib/libk2context.so(k2::TopSort(k2::Ragged<k2::Arc>&, k2::Ragged<k2::Arc>*, k2::Array1<int>*)+0x14b) [0x7f94b9f19f2b]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x8d504) [0x7f94bf9b8504]
/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so(+0x42737) [0x7f94bf96d737]
python3(PyCFunction_Call+0x59) [0x5f6489]
python3(_PyObject_MakeTpCall+0x296) [0x5f7056]
python3(_PyEval_EvalFrameDefault+0x62d2) [0x5715a2]
python3(_PyFunction_Vectorcall+0x1b6) [0x5f6836]
python3(_PyEval_EvalFrameDefault+0x57f2) [0x570ac2]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x1901) [0x56cbd1]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x1901) [0x56cbd1]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x1901) [0x56cbd1]
python3(_PyFunction_Vectorcall+0x1b6) [0x5f6836]
python3(PyObject_Call+0x62) [0x5f5c02]
python3(_PyEval_EvalFrameDefault+0x1f2c) [0x56d1fc]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(_PyFunction_Vectorcall+0x393) [0x5f6a13]
python3(_PyEval_EvalFrameDefault+0x72d) [0x56b9fd]
python3(_PyEval_EvalCodeWithName+0x26a) [0x569cea]
python3(PyEval_EvalCode+0x27) [0x68e7b7]
python3() [0x680001]
python3() [0x68007f]
python3() [0x680121]
python3(PyRun_SimpleFileExFlags+0x197) [0x680db7]
python3(Py_RunMain+0x212) [0x6b8122]
python3(Py_BytesMain+0x2d) [0x6b84ad]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f958ec7d083]
python3(_start+0x2e) [0x5fb39e]

and python error message:

Traceback (most recent call last):
  File "./pruned_transducer_stateless7/decode.py", line 972, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "./pruned_transducer_stateless7/decode.py", line 950, in main
    results_dict = decode_dataset(
  File "./pruned_transducer_stateless7/decode.py", line 656, in decode_dataset
    hyps_dict = decode_one_batch(
  File "./pruned_transducer_stateless7/decode.py", line 479, in decode_one_batch
    hyp_tokens = fast_beam_search_nbest_LG(
  File "/home/user/icefall/egs/atcosim/ASR/pruned_transducer_stateless7/beam_search.py", line 213, in fast_beam_search_nbest_LG
    path_lattice = k2.top_sort(k2.connect(path_lattice))
  File "/home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230526+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/fsa_algo.py", line 244, in top_sort
    ragged_arc, arc_map = _k2.top_sort(fsa.arcs, need_arc_map=need_arc_map)
RuntimeError: 
    Some bad things happened. Please read the above error messages and stack
    trace. If you are using Python, the following command may be helpful:

      gdb --args python /path/to/your/code.py

    (You can use `gdb` to debug the code. Please consider compiling
    a debug version of k2.).

    If you are unable to fix it, please open an issue at:

      https://github.com/k2-fsa/k2/issues/new

May 30 '23 09:05 rouseabout

Thanks! Can you rerun the tests with the --rerun-failed --output-on-failure options as it mentions? It might be CTest, not sure which directory it would have been in.

May 30 '23 10:05 danpovey

ctest --rerun-failed --output-on-failure 2>&1 | tee /tmp/log.txt

https://pross.sdf.org/sandpit/log.txt (467 KiB)

I paste the code here (the top_sort is in the last 5th line), can you dump the problemic lattice, you can do it with torch.save(lattice.as_dict(), file_name.pt).

https://pross.sdf.org/sandpit/path_lattice.pt (355 MiB)

I will delete these files in a few days. Cheers.

May 31 '23 01:05 rouseabout

Thanks! I am debuging it, will post the results here once available.

May 31 '23 01:05 pkufool

@rouseabout Could you also dump the lattice from fast_beam_search, thank you very much!

lattice = fast_beam_search(
      model=model,
      decoding_graph=decoding_graph,
      encoder_out=encoder_out,
      encoder_out_lens=encoder_out_lens,
      beam=beam,
      max_states=max_states,
      max_contexts=max_contexts,
      temperature=temperature,
  )

May 31 '23 01:05 pkufool

https://pross.sdf.org/sandpit/lattice.pt (6.7M)

Observation: The contents of path_lattice.pt changes each time I run decode.py (md5sum changes), whereas lattice.pt content is always the same. I expected these to be deterministic.

May 31 '23 02:05 rouseabout

https://pross.sdf.org/sandpit/lattice.pt (6.7M)

Observation: The contents of path_lattice.pt changes each time I run decode.py (md5sum changes), whereas lattice.pt content is always the same. I expected these to be deterministic.

Thank you!

~~Yes, because the nbest is random sampled from lattice, so the path_lattice may change.~~

edit: Sorry, I am wrong, the paths are not randomly sampled, see https://k2-fsa.github.io/k2/python_api/api.html#random-paths. So this might be another issue.

May 31 '23 03:05 pkufool

@rouseabout Sorry for the slow reply, I can not reproduce the error with the lattices you provided. The properties of path_lattice.pt and lattice.pt are:

I also tried create path_lattice from lattice, and the top_sort runs normally.

from icefall.decode import Nbest

lattice = k2.Fsa.from_dict(torch.load("/star-kw/kangwei/issues/k2_1204/lattice.pt"))
lattice = lattice.to("cuda:4")

nbest = Nbest.from_lattice(
    lattice=lattice,
    num_paths=200,
    use_double_scores=True,
    nbest_scale=0.5,
)

# The following code is modified from nbest.intersect()
word_fsa = k2.invert(nbest.fsa)
if hasattr(lattice, "aux_labels"):
    # delete token IDs as it is not needed
    del word_fsa.aux_labels
word_fsa.scores.zero_()
word_fsa_with_epsilon_loops = k2.linear_fsa_with_self_loops(word_fsa)
path_to_utt_map = nbest.shape.row_ids(1)

if hasattr(lattice, "aux_labels"):
    # lattice has token IDs as labels and word IDs as aux_labels.
    # inv_lattice has word IDs as labels and token IDs as aux_labels
    inv_lattice = k2.invert(lattice)
    inv_lattice = k2.arc_sort(inv_lattice)
else:
    inv_lattice = k2.arc_sort(lattice)

if inv_lattice.shape[0] == 1:
    path_lattice = k2.intersect_device(
        inv_lattice,
        word_fsa_with_epsilon_loops,
        b_to_a_map=torch.zeros_like(path_to_utt_map),
        sorted_match_a=True,
    )
else:
    path_lattice = k2.intersect_device(
        inv_lattice,
        word_fsa_with_epsilon_loops,
        b_to_a_map=path_to_utt_map,
        sorted_match_a=True,
    )

Could you check that the lattice you have dumpped is the problemic one, thank you very much!

Jun 02 '23 04:06 pkufool

@pkufool Really appreciate you looking intro this. It is not urgent.

I can confirm the lattice.pt and path_lattice.pt were output from /pruned_transducer_stateless7/decode.py --decoding-method fast_beam_search_nbest_LG and it crashed at top_sort.cu:324.

When I run you notebook lines, I observe the same shape and properties_str output.

When I run your code, changing cuda:4 to cuda:0, it runs normally, no crash...

HOWEVER, your code is missing the line from fast_beam_search_nbest_LG() that invokes top_sort:

path_lattice = k2.top_sort(k2.connect(path_lattice))

After adding this line to your code, it crashes at top_sort.cu:324.

What GPU are you testing on?

Jun 02 '23 07:06 rouseabout

HOWEVER, your code is missing the line from fast_beam_search_nbest_LG() that invokes top_sort:

See cell 14.

What GPU are you testing on?

I tested it on a nvidia V100. (pytorch version 1.8.1, cuda version 10.2).

Jun 02 '23 07:06 pkufool

Before invokeing top_sort, path_lattice already has a properties TopSortedAndAcyclic, so I think TopSort algorithm will not do anything, the crash is odd. So, what's your k2, pytorch and cuda version?

Jun 02 '23 07:06 pkufool

Opps, I missed cell 14 :(

k2 version: 1.24.3
Build type: Release
Git SHA1: 1a76309e5c6343c4d18965b7ce134a7f311d9d3a
Git date: Sun May 28 06:04:03 2023
Cuda used to build k2: 12.1
cuDNN used to build k2: 
Python version used to build k2: 3.8
OS used to build k2: 
CMake version: 3.24.1
GCC version: 9.4.0
CMAKE_CUDA_FLAGS:  -Wno-deprecated-gpu-targets   -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_50,code=sm_50  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_60,code=sm_60  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_61,code=sm_61  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_70,code=sm_70  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_75,code=sm_75  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_80,code=sm_80  -lineinfo --expt-extended-lambda -use_fast_math -Xptxas=-w  --expt-extended-lambda -gencode arch=compute_86,code=sm_86 -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=integer_sign_change,--diag_suppress=useless_using_declaration,--diag_suppress=set_but_not_used,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=implicit_return_from_non_void_function,--diag_suppress=unsigned_compare_with_zero,--diag_suppress=declared_but_not_referenced,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 --compiler-options -Wall  --compiler-options -Wno-strict-overflow  --compiler-options -Wno-unknown-pragmas 
CMAKE_CXX_FLAGS:  -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-unused-variable  -Wno-strict-overflow 
PyTorch version used to build k2: 2.1.0a0+fe05266
PyTorch is using Cuda: 12.1
NVTX enabled: True
With CUDA: True
Disable debug: True
Sync kernels : False
Disable checks: False
Max cpu memory allocate: 214748364800 bytes (or 200.0 GB)
k2 abort: False
__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230530+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/k2/version/version.py
_k2.__file__: /home/user/.local/lib/python3.8/site-packages/k2-1.24.3.dev20230530+cuda12.1.torch2.1.0a0-py3.8-linux-x86_64.egg/_k2.cpython-38-x86_64-linux-gnu.so

I am using this docker image (https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-23-04.html). I will try an older image.

user@2dfad2bc2655:~$ pip list | grep ^torch
torch                   2.1.0a0+fe05266
torch-tensorrt          1.4.0.dev0
torchaudio              2.1.0a0+6425d46                          /home/user/audio
torchtext               0.13.0a0+fae8e8c
torchvision             0.15.0a0

user@2dfad2bc2655:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

Jun 02 '23 08:06 rouseabout

Results:

8GB Tesla P4:

Container	PyTorch	CUDA	Status
nvcr.io/nvidia/pytorch:22.05-py3	1.12.0a0+8a1a93a	11.7.0	WORKING
nvcr.io/nvidia/pytorch:22.12-py3	1.14.0a0+410ce96	11.8.0	WORKING
nvcr.io/nvidia/pytorch:23.02-py3	1.14.0a0+44dac51	12.0.1	CRASH
nvcr.io/nvidia/pytorch:23.04-py3	2.1.0a0+fe05266f	12.1.0	CRASH

16GB Tesla T4:

Container	PyTorch	CUDA	Status
nvcr.io/nvidia/pytorch:22.12-py3	1.14.0a0+410ce96	11.8.0	WORKING
nvcr.io/nvidia/pytorch:23.02-py3	1.14.0a0+44dac51	12.0.1	CRASH

Software/hardware configurations were otherwise identical. While its only a few data points, one might conclude k2 + CUDA 12.x has problems.

Jun 05 '23 09:06 rouseabout

8GB Tesla P4:

Container	PyTorch	CUDA	Status
nvcr.io/nvidia/pytorch:23.05-py3	2.0.0	12.1.1	CRASH

Jun 05 '23 21:06 rouseabout

Thanks! we will debug it on cuda 12.x

Jun 06 '23 11:06 pkufool

k2 k2 copied to clipboard

k2::TopSorter::TopSort assertion, but only when using GPU

k2
k2 copied to clipboard