cicirori issues

Results 6 issues of


                                            cicirori

Parallel compile on GPU backend would slow down execution performance.

## 🐛 Bug Parallel compile ~~on PJRT:GPU~~ would slow down execution performance. ~~(Probably not about PJRT, I haven't tested the performance of parallel compilation on XRT.)~~ (Just tested on XRT,...

xla:gpu

gradient checkpoint cause bigger memory usage on GPU

## ❓ Questions and Help Recently I started testing GC performance on the GPU on the master version of pytorch and torch xla. Unfortunately in consistent with my previous conclusions(https://github.com/pytorch/xla/issues/3455#issuecomment-1101056839),...

xla:gpu

AllReduce XLA Token support

## ❓ Questions and Help https://github.com/pytorch/xla/blob/master/torch_xla/csrc/init_python_bindings.cpp#L349 I was wondering if XLA AllReduce() already supports the standard xla token? I'm trying to overlap the computation and communication of torch xla in...

enhancement

nostale

[Questions] Why use sub computation other than a few instructions?

## ❓ Questions and Help I am trying to use torch xla to train a model with 1.3B parameters. However, it takes more than two hours to compile the model....

xla:gpu

[Question] pretrain 的训练 config

### Required prerequisites - [X] I have read the documentation . - [X] I have searched the [Issue Tracker](https://github.com/baichuan-inc/baichuan-7B/issues) and [Discussions](https://github.com/baichuan-inc/baichuan-7B/discussions) that this hasn't already been reported. (+1 or comment...

question

[Question] Why use datatype ncclInt8 in nccl_all_to_all_scatter_async.

Wondering why the ncclint8 datatype is used in the C++ implementation of nccl_all_to_all_scatter_async, whether it's for speed reasons or simply because don't want to support multiple datatypes through templates. Thanks!