HugeCTR [BUG] Training fails if the number of parquet files is less than the number of GPUs

Describe the bug I am running the Scaling Criteo sample with minor changes: https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/scaling-criteo/03-Training-with-HugeCTR.ipynb

I have 24 parquet files for training and 1 parquet file for validation. If I run the training with 4 x GPUs (A100), the job fails with the following message:

Check Failed! File: /hugectr/HugeCTR/include/data_readers/file_source_parquet.hpp:99 Function: ParquetFileSource Expression: file_list_.get_num_of_files() >= stride_ Hint: The number of data reader workers should be no greater than the number of files in the file list. There is one worker on each GPU for Parquet dataset, please re-configure vvgpu within CreateSolver or guarantee enough files in the file list. FULL error log: https://gist.github.com/leiterenato/3efd6735b0ce792a3f127a0b14ec6d0d

To Reproduce Steps to reproduce the behavior:

Download the model.py (https://gist.github.com/leiterenato/c79aa3a97bd98953a4436134670626b5) and task.py (https://gist.github.com/leiterenato/96380c8e8f7106031f5db4fb16d7898c)
Execute in an instance with 4 GPUs (A100). There must be only 1 file for validation or training (criteo dataset).

python task.py \
    --per_gpu_batch_size=2048 \
    --model_name=deepfm \
    --train_data=/workspaces/dataset/new_transformed/_train_list.txt \
    --valid_data=/workspaces/dataset/new_transformed/_valid_list.txt \
    --schema=/workspaces/dataset/new_transformed/schema.pbtxt \
    --slot_size_array="10000000 10000000 3014529 400781 11 2209 11869 148 4 977 15 38713 10000000 10000000 10000000 584616 12883 109 37 17177 7425 20266 4 7085 1535 64" \
    --max_iter=1000 \
    --max_eval_batches=500 \
    --eval_batches=2500 \
    --dropout_rate=0.5 \
    --lr=0.001 \
    --num_workers=12 \
    --num_epochs=0 \
    --eval_interval=1000 \
    --snapshot=0 \
    --display_interval=200 \
    --gpus="[[0,1,2,3]]"

Expected behavior Complete the training.

Environment (please complete the following information):

OS: Linux 5e93a2668dff 4.19.0-18-cloud-amd64 #1 SMP Debian 4.19.208-1 (2021-09-29) x86_64 x86_64 x86_64 GNU/Linux
Graphic card: 4 x A100
CUDA version: 11.0
Docker image: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/merlin/containers/merlin-training

Mar 09 '22 02:03 leiterenato

Dong meng: This is a critical bug. The the problem of this bug is that I think when we train a model right we we we kind of split the data into training data and validation data. In this case, in in the creative dataset we use the 1st 12 days as training data. The last day as a validation data. So the result is that we have 12 training files, only one validation files. So during the training, the training part is fine, but when the huge CTR trying to validate at after each equal.

Mar 14 '22 21:03 viswa-nvidia

@zehuanw , please take a look at this issue

Mar 14 '22 21:03 viswa-nvidia

Hi @leiterenato @viswa-nvidia , thank you for your feedback! It's a known issue and error message reported as expected:

Hint: The number of data reader workers should be no greater than the number of files in the file list. There is one worker on each GPU for Parquet dataset, please re-configure vvgpu within CreateSolver or guarantee enough files in the file list.

Before fixing, could you please try to split the validate dataset in a finer granularity? If you have four GPUs you can split it into four or more files.

Mar 15 '22 06:03 zehuanw

I believe @leiterenato worked around the bug by using a single gpu for training and validation. @zehuanw, when is the fix planned?

May 06 '22 22:05 sohn21c

Hi @sohn21c, it involves the underlying datareader design change and we are working on it. The modification will be not trivial, but we will keep you posted.

May 09 '22 09:05 zehuanw

@zehuanw, can you please share the updates if the change has been released? if not, when is it planned? let's post the plan and close the issue.

Nov 01 '22 16:11 jsohn-nvidia

Sorry for late response , the #files > #gpus problem has been actually fixed in 22.10, but the fix had slipped the release note https://github.com/NVIDIA-Merlin/HugeCTR/blob/main/HugeCTR/src/pybind/add_input.cpp#L277-L310 unless #files > #gpus =>#workers=#files, otherwise #workers=#gpus. Therefore, if provided #files is less than #gpu, you'll lose some performance. @leiterenato . FYI @zehuanw @jsohn-nvidia @viswa-nvidia

Nov 08 '22 11:11 JacoCheung

HugeCTR HugeCTR copied to clipboard

[BUG] Training fails if the number of parquet files is less than the number of GPUs

HugeCTR
HugeCTR copied to clipboard