wang comments

Results 11 comments of


                                            wang

[Core] Dynamic Node Labeling and Resource Registration

yeah, if the workload is submitted using RayJob, using an ephemeral Ray Cluster, we can add the custom label at starting. But if we have a long-running Ray Cluster, that...

[Core] Dynamic Node Labeling and Resource Registration

if we want to implement this, this can be achieved by using RaySyncer? Syncer will trigger NodeManager, do ConsumeSyncMessage -> UpdateResourceUsage -> UpdateNode -> ClusterResourceManager.AddOrUpdateNode. eventually update NodeResources data Structure...

[Umbrella] Ray Autoscaling tests

https://kuttl.dev/docs/#pre-requisites maybe we can consider using this tool?

[Umbrella] Ray Autoscaling tests

> @Irvingwangjr Kuttl looks cool! I read the README. It seems suitable for testing some sample YAMLs, but it doesn't seem to support well the case where we need to...

[Umbrella] Ray Autoscaling tests

https://github.com/open-feature/open-feature-operator OpenFeature(an CNCF project) adopt this tool, it also provides some examples

[Question]: activation offload won't work for torch version < 2.5

Many thanks for this! I also wanna report a bug here, reproduction script ``` def test_offloading_works_with_cpu_tensors() -> None: class SomefuncNeedCpuTensors(torch.autograd.Function): @staticmethod def forward(ctx, cpu_tenosr): assert cpu_tenosr.device == torch.device("cpu") ctx.save_for_backward(cpu_tenosr) return...

[Question]: activation offload won't work for torch version < 2.5

https://github.com/tgale96/grouped_gemm This ops is a real world example where the GMM ops need a parameter named batch_sizes and that needs to be cpu tensor

[Question]: activation offload won't work for torch version < 2.5

> thanks [@Irvingwangjr](https://github.com/Irvingwangjr) , i asked [@janeyx99](https://github.com/janeyx99) if she has availability to take a look. > > The way we use it in torchtune is that we only enable it...

[Question]: activation offload won't work for torch version < 2.5

> [@Irvingwangjr](https://github.com/Irvingwangjr) Ah, good point. Would it then make more sense to only offload if the tensor is not on CPU? yeah I actually patch the code like this: ```...

[Question]: activation offload won't work for torch version < 2.5

> [@Irvingwangjr](https://github.com/Irvingwangjr) if convenient can you check that this patch [#2466](https://github.com/pytorch/torchtune/pull/2466) does the trick? I am specializing on CUDA here because our streaming logic only works in CUDA. > >...