Chenggang Zhao
Chenggang Zhao
An easy way to ignore that is to add diagnostic pragmas around the entire implement: ```c++ #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdeprecated-declarations" ... #pragma GCC diagnostic pop...
+1 with this issue
@HIT-cwh @SolenoidWGT A fix would be making a customized `CacheManager` by setting `os.environ["TRITON_CACHE_MANAGER"] = '...'`. Reference: https://github.com/openai/triton/blob/main/python/triton/runtime/cache.py. In this manager, we only put the files from rank 0 and make...
> @HIT-cwh Hi, I met the same issue and resolved it by setting `TRITON_CACHE_DIR` to a local storage instead of a shared storage. > > The root cause in my...
macOS和Windows的命令不太一样 具体可以参考英文教程:https://os.phil-opp.com/freestanding-rust-binary/
I guess it is related to your Python/pybind/compiler in your environment: https://github.com/pybind/pybind11/issues/3623.
> for the ptx instruction ld.global.nc.L1::no_allocate.L2::256B you mentioned, on devices which global memory is cached in L1 by default, such as Volta to Blackwell (sm70+), it's equivalent to ld.global.L1::no_allocate.L2::256B. Maybe...
Got it, and thanks very much for your detailed explanation! I will fix the related code later (towards semantic correctness).
> Is synchronization across all ranks needed before dispatching SEND/RECV operations? > Is synchronization across all ranks needed after dispatching SEND/RECV operations? > Is synchronization across all ranks needed before...
For example, the only wait-data-arrival of dispatch is here: https://github.com/deepseek-ai/DeepEP/blob/main/csrc/kernels/internode_ll.cu#L492.