Megatron-LM [core dataset compilation error]

Describe the bug When I am using the most recent Megatrone-LM fork I get the following error

make: Entering directory '/workspace/megatron-lm/megatron/core/datasets'
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so
make: Leaving directory '/workspace/megatron-lm/megatron/core/datasets'
ERROR:megatron.core.datasets.utils:Failed to compile the C++ dataset helper functions

To Reproduce

#!/bin/bash
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --gpus-per-node=8
#SBATCH --partition=batch  # Adjust this for your cluster
#SBATCH --output=/home/shamane/logs/training_scratch/log.out # Adjust this for your cluster
#SBATCH --err=/home/shamane/logs/training_scratch/error.err    # Adjust this for your cluster
export MASTER_ADDR=$(hostname)
export GPUS_PER_NODE=8

# ---

export LD_LIBRARY_PATH=/usr/lib:/usr/lib64
export NCCL_TESTS_HOME=nccl-tests
export NCCL_DEBUG=INFO
export NCCL_ALGO=RING

export NCCL_IB_AR_THRESHOLD=0
export NCCL_IB_PCI_RELAXED_ORDERING=1
export NCCL_IB_SPLIT_DATA_ON_QPS=0
export NCCL_IB_QPS_PER_CONNECTION=2
export UCX_IB_PCI_RELAXED_ORDERING=on
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_SOCKET_IFNAME=enp27s0np0
export NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
export NCCL_IGNORE_CPU_AFFINITY=1

# ---


nodes_array=($(scontrol show hostnames $SLURM_JOB_NODELIST))
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

echo "Node IP: $head_node_ip"


# Specify the Docker image to use.
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3"

# Define the path to the Megatron-LM directory on the head node.
MEGATRONE_PATH="/home/shamane/Megatron-LM-luke" # Update with actual path. Path should be on the head node.

# Set paths for checkpoints and tokenizer data. These should be on a shared data directory.
SHARED_DIR="/data/fin_mixtral_2B/"

#MASTER_ADDR=${MASTER_ADDR:-"localhost"}
MASTER_ADDR=$head_node_ip
MASTER_PORT=${MASTER_PORT:-"6008"}
NNODES=${SLURM_NNODES:-"1"}
NODE_RANK=${RANK:-"0"}
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "SLURM_NNODES: $SLURM_NNODES"
echo "SLURM_NODEID: $SLURM_NODEID"

echo "MASTER_ADDR: $MASTER_ADDR"
echo "NNODES: $NNODES"
echo "MASTER_PORT: $MASTER_PORT"
echo "NODE_RANK: $NODE_RANK"


#module load docker


echo "-v $SHARED_DIR:/workspace/data"
echo "-v $MEGATRONE_PATH:/workspace/megatron-lm"
echo "$PYTORCH_IMAGE"
echo "bash -c \"pip install flash-attn sentencepiece &&  \
           bash /workspace/megatron-lm/examples/mixtral/run_mixtral_distributed.sh \
           /workspace/data/megatrone_checkpoints \
           /workspace/data/tokenizers/tokenizer.model \
           /workspace/data/processed_data/finance_2b_mixtral_text_document \
           $MASTER_ADDR \
           $MASTER_PORT \
           $NNODES \
           $NODE_RANK\""



# # Run the Docker container with the specified PyTorch image.
srun docker run \
  -e SLURM_JOB_ID=$SLURM_JOB_ID \
  --gpus all \
  --ipc=host \
  --network=host \
  --workdir /workspace/megatron-lm \
  -v $SHARED_DIR:/workspace/data \
  -v $MEGATRONE_PATH:/workspace/megatron-lm \
     $PYTORCH_IMAGE \
  bash -c "pip install flash-attn sentencepiece wandb 'git+https://github.com/fanshiqing/[email protected]' &&  \
           bash /workspace/megatron-lm/examples/mixtral/run_mixtral_distributed.sh \
           /workspace/data/mixtral8x7-instr-tp2-emp8-ggemm \
           /workspace/data/tokenizers/tokenizer.model \
           /workspace/data/processed_data/finance_2b_mixtral_text_document \
           $MASTER_ADDR \
           $MASTER_PORT \
           $NNODES \
           $NODE_RANK"



# # This Docker command mounts the specified Megatron-LM and data directories, sets the working directory,
# # and runs the 'run_mixtral_distributed.sh' script inside the container.
# # This script facilitates distributed training using the specified PyTorch image, leveraging NVIDIA's optimizations.

Environment (please complete the following information):

PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3"

Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.

Additional context This works well with the form that I have download 4 days ago.