cicirori
cicirori
## 🐛 Bug Parallel compile ~~on PJRT:GPU~~ would slow down execution performance. ~~(Probably not about PJRT, I haven't tested the performance of parallel compilation on XRT.)~~ (Just tested on XRT,...
## ❓ Questions and Help Recently I started testing GC performance on the GPU on the master version of pytorch and torch xla. Unfortunately in consistent with my previous conclusions(https://github.com/pytorch/xla/issues/3455#issuecomment-1101056839),...
## ❓ Questions and Help https://github.com/pytorch/xla/blob/master/torch_xla/csrc/init_python_bindings.cpp#L349 I was wondering if XLA AllReduce() already supports the standard xla token? I'm trying to overlap the computation and communication of torch xla in...
## ❓ Questions and Help I am trying to use torch xla to train a model with 1.3B parameters. However, it takes more than two hours to compile the model....
### Required prerequisites - [X] I have read the documentation . - [X] I have searched the [Issue Tracker](https://github.com/baichuan-inc/baichuan-7B/issues) and [Discussions](https://github.com/baichuan-inc/baichuan-7B/discussions) that this hasn't already been reported. (+1 or comment...
Wondering why the ncclint8 datatype is used in the C++ implementation of nccl_all_to_all_scatter_async, whether it's for speed reasons or simply because don't want to support multiple datatypes through templates. Thanks!