HugeCTR [BUG] hps_demo.ipynb crashes with core dump

Describe the bug When running the hps_demo.ipynb notebook, with the latest docker container 22.05, the third code cell where we train the model will crash with a core dump. I have tested on A100 DGX and V100 Pcie systems. The code runs fine when using the 22.04 container. The output for the V100 is below:

HugeCTR Version: 3.6
====================================================Model Init=====================================================
[HCTR][00:27:59.659][INFO][RK0][main]: Initialize model: hps_demo
[HCTR][00:27:59.659][WARNING][RK0][main]: MPI was already initialized somewhere elese. Lifetime service disabled.
[HCTR][00:27:59.659][INFO][RK0][main]: Global seed is 2566660528
[HCTR][00:27:59.732][INFO][RK0][main]: Device to NUMA mapping:
  GPU 0 ->  node 0
[HCTR][00:28:00.700][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][00:28:00.700][INFO][RK0][main]: Start all2all warmup
[HCTR][00:28:00.700][INFO][RK0][main]: End all2all warmup
[HCTR][00:28:00.701][INFO][RK0][main]: Using All-reduce algorithm: NCCL
set_mempolicy: Operation not permitted
[HCTR][00:28:00.701][INFO][RK0][main]: Device 0: Tesla V100S-PCIE-32GB
[HCTR][00:28:00.701][INFO][RK0][main]: num of DataReader workers: 1
set_mempolicy: Operation not permitted
set_mempolicy: Operation not permitted
set_mempolicy: Operation not permitted
set_mempolicy: Operation not permitted
set_mempolicy: Operation not permitted
[HCTR][00:28:00.702][INFO][RK0][main]: Vocabulary size: 40000
set_mempolicy: Operation not permitted
[HCTR][00:28:00.702][INFO][RK0][main]: max_vocabulary_size_per_gpu_=21845
terminate called without an active exception
terminate called without an active exception
[hsw217:00831] *** Process received signal ***
[hsw217:00831] Signal: Aborted (6)
[hsw217:00831] Signal code:  (-6)

To Reproduce Steps to reproduce the behavior: 1: Start docker merlin docker: docker run -it --network=host --cap-add SYS_NICE --gpus all -u $(id -u):$(id -g) --ipc=host -v /path/to/hugectr:/workspace nvcr.io/nvidia/merlin/merlin-training:22.05 bash

Note: I also run export JUPYTER_DATA_DIR="/workspace" to overcome current home directory issue. 2.start jupyter notebook

cd /hugectr/notebooks
jupyter-notebook --allow-root --ip 0.0.0.0 --port 8888 --NotebookApp.token='hugectr'

Run first three cells in hps_demo.ipynb notebook to hit the error

Expected behavior The cell is expected to run.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: CentOS Linux 7
Graphic card: NVIDIA DGX A100
Docker image: nvcr.io/nvidia/merlin/merlin-training:latest

Additional context

May 27 '22 21:05 PeterDykas

Hi, @PeterDykas. Thanks for your feedback. I think the reason for this issue is boiled down to the data_generator. There exists inconsistency between the _metadata.json and the file_list.txt regarding files' name. Both files are generated by data_generator.
The _metadata.json resides in data_parquet/train/_metadata.json and it records the num_rows of each file.

The file_list looks like this:

./data_parquet/train/gen_0.parquet
./data_parquet/train/gen_1.parquet
./data_parquet/train/gen_2.parquet
./data_parquet/train/gen_3.parquet
./data_parquet/train/gen_4.parquet
./data_parquet/train/gen_5.parquet
./data_parquet/train/gen_6.parquet
./data_parquet/train/gen_7.parquet
./data_parquet/train/gen_8.parquet
./data_parquet/train/gen_9.parquet
./data_parquet/train/gen_10.parquet
./data_parquet/train/gen_11.parquet
./data_parquet/train/gen_12.parquet
./data_parquet/train/gen_13.parquet
./data_parquet/train/gen_14.parquet
./data_parquet/train/gen_15.parquet

While the _metadata.json is

{ "file_stats": [{"file_name": "0.parquet", "num_rows":40960}, {"file_name": "1.parquet", "num_rows":40960}, {"file_name": "2.parquet", "num_rows":40960}, {"file_name": "3.parquet", "num_rows":40960}, {"file_name": "4.parquet", "num_rows":40960}, {"file_name": "5.parquet", "num_rows":40960}, {"file_name": "6.parquet", "num_rows":40960}, {"file_name": "7.parquet", "num_rows":40960}, {"file_name": "8.parquet", "num_rows":40960}, {"file_name": "9.parquet", "num_rows":40960}, {"file_name": "10.parquet", "num_rows":40960}, {"file_name": "11.parquet", "num_rows":40960}, {"file_name": "12.parquet", "num_rows":40960}, {"file_name": "13.parquet", "num_rows":40960}, {"file_name": "14.parquet", "num_rows":40960}, {"file_name": "15.parquet", "num_rows":40960} ], "labels": [{"col_name": "label0", "index":0} ], "conts": [{"col_name": "C1", "index":1}, {"col_name": "C2", "index":2}, {"col_name": "C3", "index":3}, {"col_name": "C4", "index":4}, {"col_name": "C5", "index":5}, {"col_name": "C6", "index":6}, {"col_name": "C7", "index":7}, {"col_name": "C8", "index":8}, {"col_name": "C9", "index":9}, {"col_name": "C10", "index":10} ], "cats": [{"col_name": "C11", "index":11}, {"col_name": "C12", "index":12}, {"col_name": "C13", "index":13}, {"col_name": "C14", "index":14} ] }

The file_name in _metadata.json collides with file_list.txt (the filename has gen_ prefix while the _metadata.json not.). This is a minor bug and I'll fix it soon.

As a workaround, you can mannually (rename the parquet files and modify the content of file_list) OR (modify the content of _metadata.json) . Note that the file_name in _metadata.json is always the basename (Discarding preceding directories path).

May 28 '22 04:05 JacoCheung

Hi @PeterDykas, could you check if @JacoCheung 's reply helps? Thanks!

Jun 21 '22 06:06 zehuanw

The Parquet data generator issue has been fixed in 22.07 release.

Jul 13 '22 02:07 KingsleyLiu-NV

HugeCTR HugeCTR copied to clipboard

[BUG] hps_demo.ipynb crashes with core dump

HugeCTR
HugeCTR copied to clipboard