deepmd-kit icon indicating copy to clipboard operation
deepmd-kit copied to clipboard

[BUG] Abnormal file system I/O during PT backend training

Open Entropy-Enthalpy opened this issue 3 months ago • 1 comments

Bug summary

When I ran the DPA2 or DPA3 training using the PT backend, I observed abnormal and continuous heavy reading in the BeeGFS file system of the working directory (for single-GPU training, it exceeded 2 Gbps). Just a few dozen single-GPU training jobs would fill up the 100 Gbps storage node bandwidth, and the training speed would significantly decrease.

These reading operations did not occur on the HW disks, indicating that the range of data blocks being read is very small and the reading operation hit the RAM cache.

This phenomenon does not exist in the ordinary NFSoRDMA file system.

Platform

  1. The Open Source Supercomputing Center of S-A-I;
  2. LiuLab-HPC

File System

  1. BeeGFS 8.1.0 (RoCEv2);
  2. BeeGFS 7.4.5 (IB)

Netdata Monitor of File System I/O on Compute Nodes Image

DeePMD-kit Version

3.0.0 ~ 3.1.1

Backend and its version

bundled with all offline packages

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Slurm sbtach script:

#!/bin/bash
#SBATCH --job-name=DP-Train
#SBATCH --partition=4V100
#SBATCH --nodes=1
#SBATCH --ntasks=1          # Nodes * GPUs-per-node * Ranks-per-GPU
#SBATCH --gpus-per-node=1   # Specify the GPUs-per-node
#SBATCH --qos=improper-gpu  # Depending on your needs [Priority: rush-4gpu = rush-8gpu > improper-gpu > huge-gpu]

export OMP_NUM_THREADS=2

nvidia-smi dmon -s pucvmte -o T > nvdmon_job-$SLURM_JOB_ID.log &

source /opt/envs/deepmd3.1.1.env

export DP_INTERFACE_PREC=low

dp --pt train input.json

Steps to Reproduce

I can provide the supercomputer account for reproducing the problem.

Further Information, Files, and Links

No response

Entropy-Enthalpy avatar Oct 11 '25 18:10 Entropy-Enthalpy

Hi @Entropy-Enthalpy , When training a DPA model using the PyTorch backend, the training data for each step is randomly selected and packed as a batch of input. This means that there will be many small-random read on the targeted data files, typically .npy. If the file system is configured for large/sequential read ops, like using a large read-ahead buffer or pre-reading the whole file, it would quickly hit the bandwidth limit, as what's happening in this issue. I suggest you checking the configuration of BeeGFS and NFS protocol if there are anything unfriendly to random read.

caic99 avatar Oct 12 '25 05:10 caic99