Shaogang Wang
Shaogang Wang
This PR uses cuGraphInstantiateWithParms instead of cuGraphInstantiate to instantiate cuda graph executors, so current command_buffer_cmd_test and command_buffer_thunk_test should cover the changes in this PR.
I found that for the common training pattern in Jax: `new_state, other_output = jitted_train_step_fn(old_state, other_input) ` Current XLA runtime may assign different backing device memory buffer for `old_state` and `new_state.`...
nccl 2.23.4 has changed the API def of NVTX_PAYLOAD_EVTATTR_SET, this API is not stable, and it should be avoided to call it externally.