[BUG]When fluid and alluxio configured the data set to be preheated, it was found that some files were not preheated successfully
What is your environment(Kubernetes version, Fluid version, etc.) K8s: v1.29.7 Containerd: 1.7.22 OS: Ubuntu 22.04.3 fluid: v1.0.2-41eefb6 alluxio/alluxio-dev:2.9.0
Describe the bug After the fluid dataset and alluxio rumtime CR resources were created, and before the fluid PVC was mounted to the K8s container instance, the dataload CR resource was created to preheat the dataset. Occasionally, some files of the dataset could not be preheated successfully. Only the preheating failure problem is displayed in the log. Do you have specific locating methods and solutions?
What you expect to happen: Each file should be preheated successfully
How to reproduce it 1、dataset、alluxio runtime CR yaml: apiVersion: data.fluid.io/v1alpha1 kind: Dataset metadata: name: fluid-8b-preheat spec: mounts:
- mountPoint: local:///yunmai/llama3-ds-ckp
name: fluid-8b-preheat
accessModes:
- ReadWriteMany
apiVersion: data.fluid.io/v1alpha1 kind: AlluxioRuntime metadata: name: fluid-8b-preheat spec: replicas: 2 # 待启动的Alluxio缓存系统Worker组件副本数。 data: replicas: 2 tieredstore: levels: - mediumtype: MEM path: /dev/shm quota: 500Gi high: "0.95" low: "0.8"
2、dataload CR yaml apiVersion: data.fluid.io/v1alpha1 kind: DataLoad metadata: name: fluid-8b-preheat spec: dataset: name: fluid-8b-preheat namespace: default loadMetadata: true target: - path: / replicas: 2
3、loader-job log
- alluxio fs distributedLoad --replication 2 / Please wait for command submission to finish.. Submitted successfully, jobControlId = 1730198234915 Waiting for the command to finish ... Get command status information below: Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/.gitattributes Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/LICENSE Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/README.md Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/USE_POLICY.md Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/config.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/configuration.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/generation_config.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00001-of-00004.safetensors Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00002-of-00004.safetensors Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00003-of-00004.safetensors Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model-00004-of-00004.safetensors Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/model.safetensors.index.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/special_tokens_map.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/tokenizer.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B/tokenizer_config.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/config.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/configuration.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/generation_config.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/latest_checkpointed_iteration.txt Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/model.safetensors.index.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_001/model_optim_rng.pt Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_003/model_optim_rng.pt Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_000/model_optim_rng.pt Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_001/model_optim_rng.pt Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_002/model_optim_rng.pt Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_01_003/model_optim_rng.pt Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/special_tokens_map.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/tokenizer.json Successfully loaded path /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/tokenizer_config.json Successfully loaded path /fluid-8b-preheat/llama3-datasets/wudao_llama3bpe_content_document.bin Successfully loaded path /fluid-8b-preheat/llama3-datasets/wudao_llama3bpe_content_document.idx Total completed file count is 31, failed file count is 2 Finished running the command, jobControlId = 1730198234915 Here are failed files: /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_002/model_optim_rng.pt, /fluid-8b-preheat/llama3-ckpts/Meta-Llama-3-8B-tp2-pp4/release/mp_rank_00_000/model_optim_rng.pt, Check out ./logs/user/distributedLoad__failures.csv for full list of failed files.
real 1m13.203s user 1m4.701s sys 0m4.343s
- echo -e 'distributedLoad on / ends'
- (( i++ )) distributedLoad on / ends
- (( i<1 ))
4、preheat dataset size
du -sh *
34G llama3-ckpts 70G llama3-datasets
Additional Information