wang

Results 11 comments of wang

yeah, if the workload is submitted using RayJob, using an ephemeral Ray Cluster, we can add the custom label at starting. But if we have a long-running Ray Cluster, that...

if we want to implement this, this can be achieved by using RaySyncer? Syncer will trigger NodeManager, do ConsumeSyncMessage -> UpdateResourceUsage -> UpdateNode -> ClusterResourceManager.AddOrUpdateNode. eventually update NodeResources data Structure...

https://kuttl.dev/docs/#pre-requisites maybe we can consider using this tool?

> @Irvingwangjr Kuttl looks cool! I read the README. It seems suitable for testing some sample YAMLs, but it doesn't seem to support well the case where we need to...

https://github.com/open-feature/open-feature-operator OpenFeature(an CNCF project) adopt this tool, it also provides some examples

Many thanks for this! I also wanna report a bug here, reproduction script ``` def test_offloading_works_with_cpu_tensors() -> None: class SomefuncNeedCpuTensors(torch.autograd.Function): @staticmethod def forward(ctx, cpu_tenosr): assert cpu_tenosr.device == torch.device("cpu") ctx.save_for_backward(cpu_tenosr) return...

https://github.com/tgale96/grouped_gemm This ops is a real world example where the GMM ops need a parameter named batch_sizes and that needs to be cpu tensor

> thanks [@Irvingwangjr](https://github.com/Irvingwangjr) , i asked [@janeyx99](https://github.com/janeyx99) if she has availability to take a look. > > The way we use it in torchtune is that we only enable it...

> [@Irvingwangjr](https://github.com/Irvingwangjr) Ah, good point. Would it then make more sense to only offload if the tensor is not on CPU? yeah I actually patch the code like this: ```...

> [@Irvingwangjr](https://github.com/Irvingwangjr) if convenient can you check that this patch [#2466](https://github.com/pytorch/torchtune/pull/2466) does the trick? I am specializing on CUDA here because our streaming logic only works in CUDA. > >...