yizhouv5 issues

Results 3 issues of


                                            yizhouv5

How to trigger gpu failure, the gpu count of node's allocatable field will be dynamically decrease

### 1. Quick Debug Information * OS/Version: Ubuntu 22.04 * Container Runtime Type/Version: Docker 20.10 * K8s Flavor/Version: k8s 1.21 * nvidia-device-plugin: v0.14.1 * node-feature-discovery: v0.13.1 ### 2. Issue or...

question

needs-triage

dlrover配合pai-megatron-patch启用flash checkpoint报错

1、组件版本 dlrover: 0.4.0 pai-megatron-patch: v0.10.3 2、问题说明在2台8*H20 GPU节点上，对llama3.1-70B模型进行预训练，并行策略为TP=8、PP=2、DP=1，每训练30个迭代保存一次checkpoint，出现checkpoint（包括权重和优化器）保存成功，但flash checkpoint执行结果显示未成功 2.1 training.py文件修改 #from megatron.training.checkpointing import load_checkpoint, save_checkpoint from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import save_checkpoint from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import load_checkpoint 2.2 通过megatron-lm框架启动训练，megatron启动参数: megatron_options=" \ --save ${SAVED_PRETRAIN_CHECKPOINT_PATH}...

[BUG]When fluid and alluxio configured the data set to be preheated, it was found that some files were not preheated successfully

**What is your environment(Kubernetes version, Fluid version, etc.)** K8s: v1.29.7 Containerd: 1.7.22 OS: Ubuntu 22.04.3 fluid: v1.0.2-41eefb6 alluxio/alluxio-dev:2.9.0 **Describe the bug** After the fluid dataset and alluxio rumtime CR resources...

bug