yizhouv5

Results 3 issues of yizhouv5

### 1. Quick Debug Information * OS/Version: Ubuntu 22.04 * Container Runtime Type/Version: Docker 20.10 * K8s Flavor/Version: k8s 1.21 * nvidia-device-plugin: v0.14.1 * node-feature-discovery: v0.13.1 ### 2. Issue or...

question
needs-triage

1、组件版本 dlrover: 0.4.0 pai-megatron-patch: v0.10.3 2、问题说明 在2台8*H20 GPU节点上,对llama3.1-70B模型进行预训练,并行策略为TP=8、PP=2、DP=1,每训练30个迭代保存一次checkpoint,出现checkpoint(包括权重和优化器)保存成功,但flash checkpoint执行结果显示未成功 2.1 training.py文件修改 #from megatron.training.checkpointing import load_checkpoint, save_checkpoint from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import save_checkpoint from dlrover.trainer.torch.flash_checkpoint.megatron_dist_ckpt import load_checkpoint 2.2 通过megatron-lm框架启动训练,megatron启动参数: megatron_options=" \ --save ${SAVED_PRETRAIN_CHECKPOINT_PATH}...

**What is your environment(Kubernetes version, Fluid version, etc.)** K8s: v1.29.7 Containerd: 1.7.22 OS: Ubuntu 22.04.3 fluid: v1.0.2-41eefb6 alluxio/alluxio-dev:2.9.0 **Describe the bug** After the fluid dataset and alluxio rumtime CR resources...

bug