[Llama3 Model Distillation] IndexError: pop from empty list

Open mZhenz opened this issue 7 months ago • 3 comments

Describe the bug

Following https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/llama/pruning-distillation pruning and distill llama3.1 8b.

When runing 04_distillation, face following error.

-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: Traceback (most recent call last):
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return _run_code(code, main_globals, None,
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     exec(code, run_globals)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-Run-2505/nemo_run/core/runners/fdl_runner.py", line 72, in <module>
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     fdl_runner_app()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 326, in __call__
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     raise e
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 309, in __call__
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return get_command(self)(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1442, in __call__
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return self.main(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 661, in main
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return _main(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 193, in _main
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     rv = self.invoke(ctx)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1226, in invoke
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return ctx.invoke(self.callback, **ctx.params)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 794, in invoke
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return callback(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 692, in wrapper
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return callback(**use_params)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-Run-2505/nemo_run/core/runners/fdl_runner.py", line 68, in fdl_direct_run
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     fdl_fn()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/collections/llm/api.py", line 434, in distill
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return train(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/collections/llm/api.py", line 127, in train
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     trainer.fit(model, data)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     call._call_and_handle_interrupt(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return function(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     self._run(model, ckpt_path=ckpt_path)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     results = self._run_stage()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     self.fit_loop.run()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     self.advance()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     self.epoch_loop.run(self._data_fetcher)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     self.advance(data_fetcher)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/pytorch/trainer.py", line 47, in advance
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     super().advance(data_fetcher)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     self._optimizer_step(batch_idx, closure)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     call._call_lightning_module_hook(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     output = fn(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/module.py", line 1306, in optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     optimizer.step(closure=optimizer_closure)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/optimizer.py", line 153, in step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 779, in optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 238, in optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return optimizer.step(closure=closure, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 124, in wrapper
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/core/optim/mcore_optim.py", line 129, in step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     loss = closure()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     closure_result = closure()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     self._result = self.closure(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return func(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     step_output = self._step_fn()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     output = fn(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 713, in training_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     out = self.model.training_step(dataloader_iter, *args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/megatron_parallel.py", line 389, in training_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return self._step(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/megatron_parallel.py", line 501, in _step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return self.forward(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/megatron_parallel.py", line 351, in forward
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     microbatch_outputs = step()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/megatron_parallel.py", line 1287, in __call__
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return self.forward_backward_func(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/Megatron-LM-2505/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     output_tensor, num_tokens = forward_step(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/Megatron-LM-2505/megatron/core/pipeline_parallel/schedules.py", line 303, in forward_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     outputs = loss_func(output_tensor)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return self._call_impl(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     return inner()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1805, in inner
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     result = forward_call(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/collections/llm/modelopt/distill/model.py", line 97, in forward
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     loss_for_ub = self._distillation_loss_fn(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:   File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/distill/distillation_model.py", line 271, in compute_kd_loss
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]:     out_t = teacher_layer._intermediate_output.pop(0)  # can store multiple in special cases
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: IndexError: pop from empty list```

Steps/Code to reproduce bug

import nemo_run as run
from nemo.collections import llm
from nemo.collections.llm.modelopt.recipes import distillation_recipe
from nemo.lightning.pytorch.strategies.utils import RestoreConfig

# Set path(s) if different:
TEACHER_MODEL_PATH = "/home/jovyan/workspace-0/xxx-share-data/coding/ckpt/llama3/Llama-3.1-8B-nemo"
SEQ_LENGTH = 8192
DATA_PATH = "/home/jovyan/workspace-0/data-25/05-coding-pruning"
DATA_PATHS = {
    "train": [1.0, f"{DATA_PATH}/250511_coding_cplt_pt_sample_1w_wikitext_tokenized_train_text_document"],
    "validation": [f"{DATA_PATH}/250511_coding_cplt_pt_sample_1w_wikitext_tokenized_train_text_document"],
    "test": [f"{DATA_PATH}/250511_coding_cplt_pt_sample_1w_wikitext_tokenized_train_text_document"],
}
INDEX_MAPPING_DIR = f"{DATA_PATH}/index_mappings"

student_model_path="/home/jovyan/workspace-0/xxx-share-data/coding/ckpt/llama3/Llama-3.1-8B-nemo-depth-pruned-2"
exp_name="Llama-3.1-8B-nemo-ft-depth-distilled-2"
exp_dir=f"/home/jovyan/workspace-0/xxx-share-data/coding/ckpt/llama3/{exp_name}"

# Change these to accommodate resources:
DEVICES = 8
NODES = 1
TENSOR_PARALLEL_SIZE = 8
PIPELINE_PARALLEL_SIZE = 1
MICRO_BATCH_SIZE = 1

# Change the fine-tuning recipe for your model and dataset (below values for demonstration purposes):
STEPS = 100
GLOBAL_BATCH_SIZE = 128
LR = 1e-4
MIN_LR = 1e-5
WARMUP_STEPS = 2
LOG_INTERVAL = 1
VAL_INTERVAL = 10
NUM_VAL_BATCHES = 5


def configure_recipe(student_model_path, exp_dir, exp_name):
    # Define the recipe
    recipe = distillation_recipe(
        student_model_path=student_model_path,
        teacher_model_path=TEACHER_MODEL_PATH,
        name=exp_name,
        num_nodes=NODES,
        num_gpus_per_node=DEVICES,
    )
    recipe.resume.restore_config = run.Config(
        RestoreConfig,
        path=student_model_path,
    )
    recipe.log.explicit_log_dir = exp_dir
    recipe.log.ckpt.every_n_train_steps = VAL_INTERVAL
    del recipe.log.ckpt.train_time_interval

    # Change dataset (default is Squad dataset)
    recipe.data = run.Config(
        llm.PreTrainingDataModule,
        paths=DATA_PATHS,
        index_mapping_dir=INDEX_MAPPING_DIR,
        seq_length=SEQ_LENGTH,
        micro_batch_size=MICRO_BATCH_SIZE,
        global_batch_size=GLOBAL_BATCH_SIZE,
    )

    # Set the training parameters if you dont want to use the recipe defaults
    recipe.trainer.max_steps = STEPS
    recipe.trainer.log_every_n_steps = LOG_INTERVAL
    recipe.trainer.val_check_interval = VAL_INTERVAL
    recipe.trainer.limit_val_batches = NUM_VAL_BATCHES
    recipe.trainer.strategy.tensor_model_parallel_size = TENSOR_PARALLEL_SIZE
    recipe.trainer.strategy.pipeline_model_parallel_size = PIPELINE_PARALLEL_SIZE
    recipe.trainer.strategy.sequence_parallel = TENSOR_PARALLEL_SIZE > 1
    recipe.optim.config.lr = LR
    recipe.optim.lr_scheduler.warmup_steps = WARMUP_STEPS
    recipe.optim.lr_scheduler.min_lr = MIN_LR

    return recipe

recipe = configure_recipe(student_model_path, exp_dir, exp_name)
print(recipe)

env_vars = {
    "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",  # Disable caching NCCL communication buffer memory
    "NCCL_NVLS_ENABLE": "0",  # Disable NVLink SHARP to save memory
}
executor = run.LocalExecutor(ntasks_per_node=recipe.trainer.devices, launcher="torchrun", env_vars=env_vars)
run.run(recipe, executor=executor, name=exp_name)

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Environment overview (please complete the following information)

My Environment setup by

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
# nemo-toolkit
cd /home/jovyan/workspace-0/xxx-proj/NeMo-2505
pip install -e ".[all]"
# nemo_run
cd /home/jovyan/workspace-0/xxx-proj/NeMo-Run-2505
pip install -e .
# megatron.core
cd /home/jovyan/workspace-0/xxx-proj/Megatron-LM-2505
pip install -e .

Environment details

If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:

OS version:
PyTorch version: 2.7.0
Python version: 3.10.12

Additional context

Add any other context about the problem here. Example: GPU model

May 13 '25 13:05 mZhenz

Hi @mZhenz can you share which NeMo container are you using? Is this from 25.04 container with nvidia-modelopt==0.27.1?

May 14 '25 19:05 kevalmorabia97

root@pruning-master-0:~# pip list | grep modelopt
nvidia-modelopt               0.27.1
nvidia-modelopt-core          0.27.1

Yes. I set up my evirnoment with this scripts. Using the latest main branch of NeMo/NeMo-Run/Megatron-LM.

pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0
apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
# nemo-toolkit
cd /home/jovyan/workspace-0/llama-proj/NeMo-2505
pip install -e ".[all]"
# nemo_run
cd /home/jovyan/workspace-0/llama-proj/NeMo-Run-2505
pip install -e .
# megatron.core
cd /home/jovyan/workspace-0/llama-proj/Megatron-LM-2505
pip install -e .

Here is my full pip list.

Package                       Version         Editable project location
----------------------------- --------------- ------------------------------------------------------
absl-py                       2.1.0
accelerate                    1.2.1
accelerated-scan              0.2.0
addict                        2.4.0
aiofiles                      24.1.0
aiohttp                       3.9.5
aiosignal                     1.3.1
alabaster                     1.0.0
alembic                       1.15.2
aniso8601                     10.0.1
annotated-types               0.7.0
antlr4-python3-runtime        4.9.3
anyio                         3.7.1
apex                          0.1
APScheduler                   3.10.4
argon2-cffi                   23.1.0
argon2-cffi-bindings          21.2.0
asciitree                     0.3.3
asttokens                     2.4.1
astunparse                    1.6.3
async-timeout                 4.0.3
attrdict                      2.0.1
attrs                         23.2.0
audioread                     3.0.1
av                            14.3.0
babel                         2.16.0
backoff                       2.2.1
bcrypt                        4.3.0
beautifulsoup4                4.13.1
bitsandbytes                  0.45.3
black                         24.10.0
bleach                        6.1.0
blinker                       1.9.0
blis                          0.7.11
boto3                         1.38.16
botocore                      1.38.16
braceexpand                   0.1.7
Brotli                        1.1.0
cachetools                    5.3.3
catalogue                     2.0.10
cdifflib                      1.2.6
certifi                       2024.7.4
cffi                          1.16.0
chardet                       5.2.0
charset-normalizer            3.3.2
click                         8.2.0
clip                          0.2.0
cloudpathlib                  0.18.1
cloudpickle                   3.0.0
cmake                         3.30.0
colorama                      0.4.6
colorlog                      6.9.0
comm                          0.2.2
confection                    0.1.5
contourpy                     1.2.1
coverage                      7.8.0
cryptography                  42.0.8
cuda-python                   12.5.0
cudf                          24.4.0
cugraph                       24.4.0
cugraph-dgl                   24.4.0
cugraph-equivariant           24.4.0
cugraph-pyg                   24.4.0
cugraph-service-client        24.4.0
cugraph-service-server        24.4.0
cuml                          24.4.0
cupy-cuda12x                  13.0.0
cycler                        0.12.1
cymem                         2.0.8
Cython                        3.0.10
cytoolz                       1.0.1
dask                          2024.1.1
dask-cuda                     24.4.0
dask-cudf                     24.4.0
dask-expr                     0.4.0
dataclasses-json              0.6.7
DataProperty                  1.1.0
datasets                      3.6.0
debugpy                       1.8.2
decorator                     5.1.1
decord                        0.6.0
defusedxml                    0.7.1
Deprecated                    1.2.18
diffusers                     0.33.1
dill                          0.3.8
Distance                      0.1.3
distlib                       0.3.4
distributed                   2024.1.1
distro                        1.9.0
dm-tree                       0.1.8
docker                        7.1.0
docker-pycreds                0.4.0
docopt                        0.6.2
docstring_parser              0.16
docutils                      0.21.2
dropout-layer-norm            0.1
editdistance                  0.8.1
einops                        0.8.0
einops-exts                   0.0.4
emoji                         2.14.1
entrypoints                   0.4
evaluate                      0.4.3
exceptiongroup                1.2.1
execnet                       2.1.1
executing                     2.0.1
expecttest                    0.1.3
fabric                        3.2.2
faiss-cpu                     1.11.0
fastapi                       0.115.12
fasteners                     0.19
fastjsonschema                2.20.0
fastrlock                     0.8.2
fiddle                        0.3.0
filelock                      3.15.4
filetype                      1.2.0
flash-attn                    2.6.3
Flask                         3.1.1
Flask-RESTful                 0.3.10
fonttools                     4.53.1
frozenlist                    1.4.1
fsspec                        2024.12.0
ftfy                          6.3.1
future                        1.0.0
g2p-en                        2.1.0
gast                          0.6.0
gdown                         5.2.0
gevent                        25.5.1
geventhttpclient              2.0.2
gitdb                         4.0.12
GitPython                     3.1.44
google-auth                   2.32.0
google-auth-oauthlib          1.0.0
graphviz                      0.20.3
greenlet                      3.2.2
grouped-gemm                  1.1.2
grpcio                        1.72.0
grpcio-tools                  1.59.2
gviz-api                      1.10.0
h11                           0.16.0
h5py                          3.13.0
hawkeye-train                 0.1.5.2
httpcore                      1.0.9
httpx                         0.27.0
huggingface-hub               0.31.2
hydra-core                    1.3.2
hypothesis                    5.35.1
idna                          3.7
igraph                        0.11.6
ijson                         3.4.0
imageio                       2.37.0
imagesize                     1.4.1
immutabledict                 4.2.0
importlib_metadata            7.1.0
inflect                       7.5.0
iniconfig                     2.0.0
inquirerpy                    0.3.4
intel-openmp                  2021.4.0
intervaltree                  3.1.0
invoke                        2.2.0
ipykernel                     6.20.2
ipython                       8.21.0
ipython-genutils              0.2.0
isort                         5.13.2
itsdangerous                  2.2.0
Janome                        0.5.0
jedi                          0.19.1
jieba                         0.42.1
Jinja2                        3.1.4
jiter                         0.9.0
jiwer                         3.1.0
jmespath                      1.0.1
joblib                        1.4.2
json5                         0.9.25
jsonlines                     4.0.0
jsonschema                    4.23.0
jsonschema-specifications     2023.12.1
jupyter-client                7.1.2
jupyter-core                  4.9.2
jupyter-server                1.13.5
jupyterlab                    3.0.16
jupyterlab-pygments           0.1.2
jupyterlab-server             2.10.3
kaldi-python-io               1.2.2
kaldiio                       2.18.1
kiwisolver                    1.4.5
kornia                        0.8.1
kornia_rs                     0.1.9
kvikio                        24.4.0
langcodes                     3.4.0
langdetect                    1.0.9
language_data                 1.2.0
latexcodec                    3.0.0
lazy_loader                   0.4
leo2-client                   2.0.7
Levenshtein                   0.27.1
lhotse                        1.31.0
libcst                        1.7.0
librosa                       0.10.1
lightning                     2.4.0
lightning-thunder             0.2.0.dev0
lightning-utilities           0.11.3.post0
lilcom                        1.8.1
lintrunner                    0.12.5
llvmlite                      0.44.0
locket                        1.0.0
loguru                        0.7.3
looseversion                  1.3.0
lxml                          5.4.0
Mako                          1.3.10
marisa-trie                   1.2.0
Markdown                      3.6
markdown-it-py                3.0.0
markdown2                     2.5.3
MarkupSafe                    2.1.5
marshmallow                   3.26.1
matplotlib                    3.9.1
matplotlib-inline             0.1.7
mbstrdecoder                  1.1.4
mdit-py-plugins               0.4.1
mdurl                         0.1.2
mediapy                       1.1.6
megablocks                    0.4.0
megatron-core                 0.13.0rc0       /home/jovyan/workspace-0/llama-proj/Megatron-LM-2505
megatron-energon              5.2.0
mistune                       3.0.2
mkl                           2021.1.1
mkl-devel                     2021.1.1
mkl-include                   2021.1.1
ml_dtypes                     0.5.0
mock                          5.1.0
more-itertools                10.7.0
mpmath                        1.3.0
msgpack                       1.0.8
multi-storage-client          0.20.3
multidict                     6.0.5
multiprocess                  0.70.16
murmurhash                    1.0.10
mypy_extensions               1.1.0
nbclassic                     0.5.6
nbclient                      0.7.0
nbconvert                     7.16.4
nbformat                      5.10.3
nemo_run                      0.5.0rc0.dev0   /home/jovyan/workspace-0/llama-proj/NeMo-Run-2505
nemo_text_processing          1.1.0
nemo-toolkit                  2.4.0rc0        /home/jovyan/workspace-0/llama-proj/NeMo-2505
nerfacc                       0.5.3
nest-asyncio                  1.6.0
networkx                      3.3
ninja                         1.11.1.1
nltk                          3.9.1
notebook                      6.4.10
notebook_shim                 0.2.4
num2words                     0.5.14
numba                         0.61.0
numcodecs                     0.11.0
numexpr                       2.10.2
numpy                         1.26.4
nvfuser                       0.2.6a0+f73ff1b
nvidia-cublas-cu12            12.6.4.1
nvidia-cuda-cupti-cu12        12.6.80
nvidia-cuda-nvrtc-cu12        12.6.77
nvidia-cuda-runtime-cu12      12.6.77
nvidia-cudnn-cu12             9.5.1.17
nvidia-cudnn-frontend         1.5.1
nvidia-cufft-cu12             11.3.0.4
nvidia-cufile-cu12            1.11.1.6
nvidia-curand-cu12            10.3.7.77
nvidia-cusolver-cu12          11.7.1.2
nvidia-cusparse-cu12          12.5.4.2
nvidia-cusparselt-cu12        0.6.3
nvidia-dali-cuda120           1.39.0
nvidia_lm_eval                25.4.1
nvidia-ml-py                  12.575.51
nvidia-modelopt               0.27.1
nvidia-modelopt-core          0.27.1
nvidia-nccl-cu12              2.26.2
nvidia-nvimgcodec-cu12        0.2.0.7
nvidia-nvjitlink-cu12         12.6.85
nvidia-nvtx-cu12              12.6.77
nvidia-pyindex                1.0.9
nvidia-pytriton               0.5.14
nvidia-resiliency-ext         0.3.0
nvtx                          0.2.5
nx-cugraph                    24.4.0
oauthlib                      3.2.2
omegaconf                     2.3.0
onnx                          1.16.0
open-clip-torch               2.24.0
openai                        1.61.0
OpenCC                        1.1.9
opencv-python                 4.10.0.84
opentelemetry-api             1.33.0
opt-einsum                    3.3.0
optree                        0.12.1
optuna                        4.3.0
packaging                     24.0
pandas                        2.2.1
pandocfilters                 1.5.1
pangu                         4.0.6.1
parameterized                 0.9.0
paramiko                      3.5.1
parso                         0.8.4
partd                         1.4.2
pathspec                      0.12.1
pathvalidate                  3.2.3
peft                          0.15.2
pesq                          0.0.4
pexpect                       4.9.0
pfzy                          0.3.4
pillow                        10.4.0
pip                           24.1.2
pipenv                        11.9.0
plac                          1.4.5
platformdirs                  4.2.2
pluggy                        1.5.0
ply                           3.11
polygraphy                    0.49.12
pooch                         1.8.2
portalocker                   3.1.1
preshed                       3.0.9
prettytable                   3.16.0
progress                      1.6
prometheus_client             0.20.0
prompt_toolkit                3.0.47
protobuf                      4.24.4
psutil                        7.0.0
ptyprocess                    0.7.0
PuLP                          3.1.1
pure-eval                     0.2.2
pyannote.core                 5.0.0
pyannote.database             5.1.3
pyannote.metrics              3.2.1
pyarrow                       18.1.0
pyarrow-hotfix                0.6
pyasn1                        0.6.0
pyasn1_modules                0.4.0
pyavi                         0.0.24
pybind11                      2.13.1
pybind11_global               2.13.1
pybtex                        0.24.0
pybtex-docutils               1.0.3
pycatbundle                   3.1.8.1
pycocotools                   2.0+nv0.8.0
pycparser                     2.22
pycryptodome                  3.19.0
pycryptodomex                 3.19.0
pydantic                      2.11.4
pydantic_core                 2.33.2
pydantic-settings             2.9.1
pydub                         0.25.1
Pygments                      2.18.0
pyleo                         1.1.0
pylibcugraph                  24.4.0
pylibcugraphops               24.4.0
pylibraft                     24.4.0
pylibwholegraph               24.4.0
pyloudnorm                    0.1.1
PyNaCl                        1.5.0
pynini                        2.1.6.post1
pynvjitlink                   0.2.3
pynvml                        12.0.0
pyparsing                     3.1.2
pypdf                         5.5.0
pypinyin                      0.54.0
pypinyin-dict                 0.9.0
pyre-extensions               0.0.32
PySocks                       1.7.1
pystoi                        0.4.1
pytablewriter                 1.2.1
pytest                        8.1.1
pytest-cov                    6.1.1
pytest-flakefinder            1.1.0
pytest-mock                   3.14.0
pytest-random-order           1.1.1
pytest-rerunfailures          14.0
pytest-runner                 6.0.1
pytest-shard                  0.1.2
pytest-xdist                  3.6.1
python-dateutil               2.9.0.post0
python-dotenv                 1.1.0
python-hostlist               1.23.0
python-iso639                 2025.2.18
python-magic                  0.4.27
python-rapidjson              1.20
pytorch-lightning             2.5.1.post0
pytorch-triton                3.0.0+989adb9a2
pytz                          2024.1
PyYAML                        6.0.1
pyzmq                         26.0.3
qwen-vl-utils                 0.0.11
raft-dask                     24.4.0
RapidFuzz                     3.13.0
rapids-dask-dependency        24.4.0a0
referencing                   0.35.1
regex                         2024.5.15
requests                      2.32.3
requests-oauthlib             2.0.0
requests-toolbelt             1.0.0
resampy                       0.4.3
rich                          13.7.1
rmm                           24.4.0
rouge-score                   0.1.2
rpds-py                       0.19.0
rsa                           4.9
ruamel.yaml                   0.18.10
ruamel.yaml.clib              0.2.12
s3fs                          0.4.2
s3transfer                    0.12.0
sacrebleu                     2.5.1
sacremoses                    0.1.1
safetensors                   0.4.5
scikit-learn                  1.5.1
scipy                         1.13.1
seaborn                       0.13.2
Send2Trash                    1.8.3
sentence-transformers         4.1.0
sentencepiece                 0.2.0
sentry-sdk                    2.28.0
setproctitle                  1.3.6
setuptools                    80.7.1
sh                            2.2.2
shellingham                   1.5.4
six                           1.16.0
smart-open                    7.0.4
smmap                         5.0.2
sniffio                       1.3.1
snowballstemmer               3.0.1
sortedcontainers              2.4.0
soundfile                     0.12.1
soupsieve                     2.5
sox                           1.5.0
soxr                          0.3.7
spacy                         3.7.5
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5
Sphinx                        8.1.3
sphinxcontrib-applehelp       2.0.0
sphinxcontrib-bibtex          2.6.3
sphinxcontrib-devhelp         2.0.0
sphinxcontrib-htmlhelp        2.1.0
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          2.0.0
sphinxcontrib-serializinghtml 2.0.0
SQLAlchemy                    2.0.41
srsly                         2.4.8
stack-data                    0.6.3
stanford-stk                  0.7.0
starlette                     0.46.2
sympy                         1.14.0
tabledata                     1.3.4
tabulate                      0.9.0
taming-transformers           0.0.1
tbb                           2021.13.0
tblib                         3.0.0
tcolorpy                      0.1.7
tenacity                      9.1.2
tensorboard                   2.14.0
tensorboard-data-server       0.7.2
tensorboard-plugin-profile    2.14.0
tensorboard-plugin-wit        1.8.1
tensorrt                      10.2.0
tensorstore                   0.1.71
termcolor                     3.1.0
terminado                     0.18.1
text-unidecode                1.3
textdistance                  4.6.3
texterrors                    0.5.1
texttable                     1.7.0
thinc                         8.2.5
threadpoolctl                 3.5.0
thriftpy2                     0.5.0
tiktoken                      0.7.0
timm                          1.0.15
tinycss2                      1.3.0
tokenizers                    0.21.0
toml                          0.10.2
tomli                         2.0.1
toolz                         0.12.1
torch                         2.7.0
torch-tensorrt                2.5.0a0
torchaudio                    2.2.0
torchdiffeq                   0.2.5
torchmetrics                  1.7.1
torchprofile                  0.0.4
torchsde                      0.2.6
torchvision                   0.22.0
torchx                        0.7.0
tornado                       6.4
tqdm                          4.66.4
tqdm-multiprocess             0.0.11
traitlets                     5.9.0
trampoline                    0.1.2
transformer-engine            1.9.0+43d0d17
transformers                  4.51.3
treelite                      4.1.2
trimesh                       4.6.9
triton                        3.3.0
tritonclient                  2.51.0
typeguard                     4.4.2
typepy                        1.3.4
typer                         0.12.3
types-dataclasses             0.6.6
typing_extensions             4.13.2
typing-inspect                0.9.0
typing-inspection             0.4.0
tzdata                        2024.1
tzlocal                       5.2
ucx-py                        0.37.0
unstructured                  0.14.9
unstructured-client           0.35.0
urllib3                       1.26.20
uvicorn                       0.34.2
virtualenv                    20.13.0+ds
virtualenv-clone              0.3.0
wandb                         0.19.11
wasabi                        1.1.3
wcwidth                       0.2.13
weasel                        0.4.1
webdataset                    0.2.111
webencodings                  0.5.1
websocket-client              1.8.0
Werkzeug                      3.1.3
wget                          3.2
wheel                         0.43.0
word2number                   1.1
wrapt                         1.16.0
xdoctest                      1.0.2
xgboost                       2.0.3
xxhash                        3.5.0
yarl                          1.9.4
zarr                          2.18.2
zict                          3.0.0
zipp                          3.19.0
zope.event                    5.0
zope.interface                7.2
zstandard                     0.23.0

May 15 '25 12:05 mZhenz

Could you try the container without installing anything else inside?

Nemo and Megatron should already be in the PYTHONPATH at /opt/NeMo and /opt/megatron-lm

May 15 '25 13:05 AAnoosheh