Yanming W.

Results 21 comments of Yanming W.

@cicirori Thanks for brining this interesting topic! Can you share more details about what code changes you made to reduce the compilation time by about 30%? One method I found...

@ang868 I saw very similar patterns before. XLA implements these RNGs using other XLA ops (e.g. [here](https://github.com/tensorflow/tensorflow/blob/490246643af1043ead0f5584158c062a63004012/tensorflow/compiler/xla/client/lib/prng.cc#L46-L49)) and the generated cuda kernel may not be as performant as those pytorch...

I think it's possible to use torch-xla with C++. An example is the c++ gtest programs in `test/cpp/`. You may write a custom cmake file similar to https://github.com/pytorch/xla/blob/master/test/cpp/CMakeLists.txt to compile...

Interesting, the test was supposed to pass and it passed in my local environment. I though the issue has been fixed by https://github.com/pytorch/pytorch/pull/82189 and my PR https://github.com/pytorch/pytorch/pull/82010 is not required....

It looks like the previous CI failure is due to eager debug mode. So to get around the `run_eager_debug` issue, I created a new file. This PR is ready for...

@JackCaoG This one is ready.

You may want to checkout why these env variables are not returned correctly in you environment here https://github.com/pytorch/xla/blob/371bdf462a6b7576e294ae39e5ba23d8509d7834/test/cpp/run_tests.sh#L55-L56 For my environment, I'm getting `PYTHON_INCLUDE_DIR=/home/ubuntu/anaconda3/include/python3.8` `PYTHON_LIBRARY=/home/ubuntu/anaconda3/lib/libpython3.8.so`

@cicirori We have a PR to fix this issue https://github.com/tensorflow/tensorflow/pull/57108. If it works for you, please let me know.

@JackCaoG I think we may consider making it default by setting `XLA_FLAGS="--xla_gpu_force_compilation_parallelism=?"` [here](https://github.com/pytorch/xla/blob/1b0d4feca391303bcfe2846bc198b5e89f8f72d4/torch_xla/__init__.py#L48). Enabling multi-threading for compilation did significant improve the GPU user experience by reducing >50% of the compilation...

FYI, XLA algebraic optimizer can remove this round trip to f32 during lowering. So it shouldn't affect the performance. Update: I only verified this on GPU and this may become...