wang
wang
yeah, if the workload is submitted using RayJob, using an ephemeral Ray Cluster, we can add the custom label at starting. But if we have a long-running Ray Cluster, that...
if we want to implement this, this can be achieved by using RaySyncer? Syncer will trigger NodeManager, do ConsumeSyncMessage -> UpdateResourceUsage -> UpdateNode -> ClusterResourceManager.AddOrUpdateNode. eventually update NodeResources data Structure...
https://kuttl.dev/docs/#pre-requisites maybe we can consider using this tool?
> @Irvingwangjr Kuttl looks cool! I read the README. It seems suitable for testing some sample YAMLs, but it doesn't seem to support well the case where we need to...
https://github.com/open-feature/open-feature-operator OpenFeature(an CNCF project) adopt this tool, it also provides some examples
Many thanks for this! I also wanna report a bug here, reproduction script ``` def test_offloading_works_with_cpu_tensors() -> None: class SomefuncNeedCpuTensors(torch.autograd.Function): @staticmethod def forward(ctx, cpu_tenosr): assert cpu_tenosr.device == torch.device("cpu") ctx.save_for_backward(cpu_tenosr) return...
https://github.com/tgale96/grouped_gemm This ops is a real world example where the GMM ops need a parameter named batch_sizes and that needs to be cpu tensor
> thanks [@Irvingwangjr](https://github.com/Irvingwangjr) , i asked [@janeyx99](https://github.com/janeyx99) if she has availability to take a look. > > The way we use it in torchtune is that we only enable it...
> [@Irvingwangjr](https://github.com/Irvingwangjr) Ah, good point. Would it then make more sense to only offload if the tensor is not on CPU? yeah I actually patch the code like this: ```...
> [@Irvingwangjr](https://github.com/Irvingwangjr) if convenient can you check that this patch [#2466](https://github.com/pytorch/torchtune/pull/2466) does the trick? I am specializing on CUDA here because our streaming logic only works in CUDA. > >...