Shaogang Wang

Results 7 issues of Shaogang Wang

This PR uses cuGraphInstantiateWithParms instead of cuGraphInstantiate to instantiate cuda graph executors, so current command_buffer_cmd_test and command_buffer_thunk_test should cover the changes in this PR.

I found that for the common training pattern in Jax: `new_state, other_output = jitted_train_step_fn(old_state, other_input) ` Current XLA runtime may assign different backing device memory buffer for `old_state` and `new_state.`...

enhancement

nccl 2.23.4 has changed the API def of NVTX_PAYLOAD_EVTATTR_SET, this API is not stable, and it should be avoided to call it externally.

awaiting review
comp:xla
size:XS