Megatron-LM
Megatron-LM copied to clipboard
[core dataset compilation error]
Describe the bug When I am using the most recent Megatrone-LM fork I get the following error
make: Entering directory '/workspace/megatron-lm/megatron/core/datasets'
g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/usr/include/python3.10 -I/usr/local/lib/python3.10/dist-packages/pybind11/include helpers.cpp -o helpers.cpython-310-x86_64-linux-gnu.so
make: Leaving directory '/workspace/megatron-lm/megatron/core/datasets'
ERROR:megatron.core.datasets.utils:Failed to compile the C++ dataset helper functions
To Reproduce
#!/bin/bash
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH --gpus-per-node=8
#SBATCH --partition=batch # Adjust this for your cluster
#SBATCH --output=/home/shamane/logs/training_scratch/log.out # Adjust this for your cluster
#SBATCH --err=/home/shamane/logs/training_scratch/error.err # Adjust this for your cluster
export MASTER_ADDR=$(hostname)
export GPUS_PER_NODE=8
# ---
export LD_LIBRARY_PATH=/usr/lib:/usr/lib64
export NCCL_TESTS_HOME=nccl-tests
export NCCL_DEBUG=INFO
export NCCL_ALGO=RING
export NCCL_IB_AR_THRESHOLD=0
export NCCL_IB_PCI_RELAXED_ORDERING=1
export NCCL_IB_SPLIT_DATA_ON_QPS=0
export NCCL_IB_QPS_PER_CONNECTION=2
export UCX_IB_PCI_RELAXED_ORDERING=on
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export NCCL_SOCKET_IFNAME=enp27s0np0
export NCCL_IB_HCA=mlx5_0:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_7:1,mlx5_8:1,mlx5_9:1
export NCCL_IGNORE_CPU_AFFINITY=1
# ---
nodes_array=($(scontrol show hostnames $SLURM_JOB_NODELIST))
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
echo "Node IP: $head_node_ip"
# Specify the Docker image to use.
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3"
# Define the path to the Megatron-LM directory on the head node.
MEGATRONE_PATH="/home/shamane/Megatron-LM-luke" # Update with actual path. Path should be on the head node.
# Set paths for checkpoints and tokenizer data. These should be on a shared data directory.
SHARED_DIR="/data/fin_mixtral_2B/"
#MASTER_ADDR=${MASTER_ADDR:-"localhost"}
MASTER_ADDR=$head_node_ip
MASTER_PORT=${MASTER_PORT:-"6008"}
NNODES=${SLURM_NNODES:-"1"}
NODE_RANK=${RANK:-"0"}
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
echo "SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST"
echo "SLURM_NNODES: $SLURM_NNODES"
echo "SLURM_NODEID: $SLURM_NODEID"
echo "MASTER_ADDR: $MASTER_ADDR"
echo "NNODES: $NNODES"
echo "MASTER_PORT: $MASTER_PORT"
echo "NODE_RANK: $NODE_RANK"
#module load docker
echo "-v $SHARED_DIR:/workspace/data"
echo "-v $MEGATRONE_PATH:/workspace/megatron-lm"
echo "$PYTORCH_IMAGE"
echo "bash -c \"pip install flash-attn sentencepiece && \
bash /workspace/megatron-lm/examples/mixtral/run_mixtral_distributed.sh \
/workspace/data/megatrone_checkpoints \
/workspace/data/tokenizers/tokenizer.model \
/workspace/data/processed_data/finance_2b_mixtral_text_document \
$MASTER_ADDR \
$MASTER_PORT \
$NNODES \
$NODE_RANK\""
# # Run the Docker container with the specified PyTorch image.
srun docker run \
-e SLURM_JOB_ID=$SLURM_JOB_ID \
--gpus all \
--ipc=host \
--network=host \
--workdir /workspace/megatron-lm \
-v $SHARED_DIR:/workspace/data \
-v $MEGATRONE_PATH:/workspace/megatron-lm \
$PYTORCH_IMAGE \
bash -c "pip install flash-attn sentencepiece wandb 'git+https://github.com/fanshiqing/[email protected]' && \
bash /workspace/megatron-lm/examples/mixtral/run_mixtral_distributed.sh \
/workspace/data/mixtral8x7-instr-tp2-emp8-ggemm \
/workspace/data/tokenizers/tokenizer.model \
/workspace/data/processed_data/finance_2b_mixtral_text_document \
$MASTER_ADDR \
$MASTER_PORT \
$NNODES \
$NODE_RANK"
# # This Docker command mounts the specified Megatron-LM and data directories, sets the working directory,
# # and runs the 'run_mixtral_distributed.sh' script inside the container.
# # This script facilitates distributed training using the specified PyTorch image, leveraging NVIDIA's optimizations.
Environment (please complete the following information):
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:24.03-py3"
Proposed fix If you have a proposal for how to fix the issue state it here or link to a PR.
Additional context This works well with the form that I have download 4 days ago.
Marking as stale. No activity in 60 days.
Compile manually in /megatron/core/datasets and then comment out func compile_helpers in /megatron/core/datasets/utils.py
Marking as stale. No activity in 60 days.