Siyuan Liu comments

Results 18 comments of


                                            Siyuan Liu

fix torchvision installation in tpu ci test setup script

Cannot find a way to read the `pytorch/.github/ci_commit_pins` in `test/tpu/xla_test_job.yaml`. When doing `pip install "git+https://github.com/pytorch/vision.git@$TORCHVISION_COMMIT"`, `$TORCHVISION_COMMIT` will be an empty str. To rule out the fact that we cannot change...

fix torchvision installation in tpu ci test setup script

> Can you also mirror the fix to the GHA CI if it works? https://github.com/pytorch/xla/blob/master/.github/workflows/tpu_ci.yml > > The commit pin will be more readily available. I had a few attempts...

fix torchvision installation in tpu ci test setup script

> https://github.com/pytorch/xla/blob/master/.github/workflows/tpu_ci.yml GHA TPU CI fixed in https://github.com/pytorch/xla/pull/6730

Convert torch.erf to chlo.erf when lowering to StableHLO

cc @GleasonK

Convert torch.erf to chlo.erf when lowering to StableHLO

Hi @Nullkooland, to clarify the request, there is no `erf` op in the StableHLO, do you expect a bunch of decomposed StableHLO ops for the `erf` op in the exported...

Pin update March 2024

Hit the following error ``` File "/home/lsiyuan/.cache/bazel/_bazel_lsiyuan/9d8c0c9d904275861907f86bf4a21dbc/external/llvm-project/mlir/BUILD.bazel", line 40, column 7, in } | if_cuda_available( Error: unsupported binary operation: dict | select ``` Need to upgrade bazel version to above...

Pin update March 2024

Testing performance with the following cmd, on v4-8 TPU ``` python test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1 --metrics_debug ``` After pin update ``` | Training Device=xla:0/1 Epoch=1 Step=2280 Loss=0.00135 Rate=425.92 GlobalRate=370.64 Time=23:07:35...

Pin update March 2024

Test failed with PT2E test, because the converter patch is commented out now. Move xla pin again after https://github.com/pytorch/xla/blob/master/openxla_patches/quant_dequant_converter.diff is upstreamed

Pin update March 2024

The following GPU tests hit OOM in CI after pin update ``` PJRT_DEVICE=CUDA torchrun --nnodes=1 --node_rank=0 --nproc_per_node=2 test/test_train_mp_imagenet.py --fake_data --pjrt_distributed --batch_size=16 --num_epochs=1 --num_steps=25 --model=resnet18 PJRT_DEVICE=CUDA python test/test_train_mp_imagenet_fsdp.py --fake_data --auto_wrap_policy type_based...

Pin update March 2024

cc @will-cromar for some PJRT changes to accommodate the change of PJRT interface in upstream XLA .