[BUG] Abnormal file system I/O during PT backend training
Bug summary
When I ran the DPA2 or DPA3 training using the PT backend, I observed abnormal and continuous heavy reading in the BeeGFS file system of the working directory (for single-GPU training, it exceeded 2 Gbps). Just a few dozen single-GPU training jobs would fill up the 100 Gbps storage node bandwidth, and the training speed would significantly decrease.
These reading operations did not occur on the HW disks, indicating that the range of data blocks being read is very small and the reading operation hit the RAM cache.
This phenomenon does not exist in the ordinary NFSoRDMA file system.
Platform
- The Open Source Supercomputing Center of S-A-I;
- LiuLab-HPC
File System
- BeeGFS 8.1.0 (RoCEv2);
- BeeGFS 7.4.5 (IB)
Netdata Monitor of File System I/O on Compute Nodes
DeePMD-kit Version
3.0.0 ~ 3.1.1
Backend and its version
bundled with all offline packages
How did you download the software?
Offline packages
Input Files, Running Commands, Error Log, etc.
Slurm sbtach script:
#!/bin/bash
#SBATCH --job-name=DP-Train
#SBATCH --partition=4V100
#SBATCH --nodes=1
#SBATCH --ntasks=1 # Nodes * GPUs-per-node * Ranks-per-GPU
#SBATCH --gpus-per-node=1 # Specify the GPUs-per-node
#SBATCH --qos=improper-gpu # Depending on your needs [Priority: rush-4gpu = rush-8gpu > improper-gpu > huge-gpu]
export OMP_NUM_THREADS=2
nvidia-smi dmon -s pucvmte -o T > nvdmon_job-$SLURM_JOB_ID.log &
source /opt/envs/deepmd3.1.1.env
export DP_INTERFACE_PREC=low
dp --pt train input.json
Steps to Reproduce
I can provide the supercomputer account for reproducing the problem.
Further Information, Files, and Links
No response
Hi @Entropy-Enthalpy ,
When training a DPA model using the PyTorch backend, the training data for each step is randomly selected and packed as a batch of input. This means that there will be many small-random read on the targeted data files, typically .npy.
If the file system is configured for large/sequential read ops, like using a large read-ahead buffer or pre-reading the whole file, it would quickly hit the bandwidth limit, as what's happening in this issue.
I suggest you checking the configuration of BeeGFS and NFS protocol if there are anything unfriendly to random read.