[Llama3 Model Distillation] IndexError: pop from empty list
Describe the bug
Following https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/llama/pruning-distillation pruning and distill llama3.1 8b.
When runing 04_distillation, face following error.
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: Traceback (most recent call last):
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return _run_code(code, main_globals, None,
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: exec(code, run_globals)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-Run-2505/nemo_run/core/runners/fdl_runner.py", line 72, in <module>
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: fdl_runner_app()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 326, in __call__
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: raise e
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 309, in __call__
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return get_command(self)(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1442, in __call__
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return self.main(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 661, in main
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return _main(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 193, in _main
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: rv = self.invoke(ctx)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1226, in invoke
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return ctx.invoke(self.callback, **ctx.params)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 794, in invoke
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return callback(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 692, in wrapper
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return callback(**use_params)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-Run-2505/nemo_run/core/runners/fdl_runner.py", line 68, in fdl_direct_run
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: fdl_fn()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/collections/llm/api.py", line 434, in distill
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return train(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/collections/llm/api.py", line 127, in train
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: trainer.fit(model, data)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: call._call_and_handle_interrupt(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 46, in _call_and_handle_interrupt
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return function(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: self._run(model, ckpt_path=ckpt_path)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: results = self._run_stage()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: self.fit_loop.run()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: self.advance()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: self.epoch_loop.run(self._data_fetcher)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: self.advance(data_fetcher)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/pytorch/trainer.py", line 47, in advance
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: super().advance(data_fetcher)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: self._optimizer_step(batch_idx, closure)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: call._call_lightning_module_hook(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: output = fn(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/module.py", line 1306, in optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: optimizer.step(closure=optimizer_closure)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/core/optimizer.py", line 153, in step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 779, in optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/strategies/strategy.py", line 238, in optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return optimizer.step(closure=closure, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 124, in wrapper
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return func.__get__(opt, opt.__class__)(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/core/optim/mcore_optim.py", line 129, in step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: loss = closure()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: closure_result = closure()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: self._result = self.closure(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return func(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: step_output = self._step_fn()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: output = fn(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/pytorch/strategies/megatron_strategy.py", line 713, in training_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: out = self.model.training_step(dataloader_iter, *args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/megatron_parallel.py", line 389, in training_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return self._step(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/megatron_parallel.py", line 501, in _step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return self.forward(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/megatron_parallel.py", line 351, in forward
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: microbatch_outputs = step()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/lightning/megatron_parallel.py", line 1287, in __call__
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return self.forward_backward_func(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/Megatron-LM-2505/megatron/core/pipeline_parallel/schedules.py", line 500, in forward_backward_no_pipelining
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: output_tensor, num_tokens = forward_step(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/Megatron-LM-2505/megatron/core/pipeline_parallel/schedules.py", line 303, in forward_step
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: outputs = loss_func(output_tensor)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return self._call_impl(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1857, in _call_impl
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: return inner()
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1805, in inner
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: result = forward_call(*args, **kwargs)
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/home/jovyan/workspace-0/xxx-proj/NeMo-2505/nemo/collections/llm/modelopt/distill/model.py", line 97, in forward
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: loss_for_ub = self._distillation_loss_fn(
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: File "/usr/local/lib/python3.10/dist-packages/modelopt/torch/distill/distillation_model.py", line 271, in compute_kd_loss
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: out_t = teacher_layer._intermediate_output.pop(0) # can store multiple in special cases
-distilled/0 [default4]:-distilled/0 [default2]:[rank2]: IndexError: pop from empty list```
Steps/Code to reproduce bug
import nemo_run as run
from nemo.collections import llm
from nemo.collections.llm.modelopt.recipes import distillation_recipe
from nemo.lightning.pytorch.strategies.utils import RestoreConfig
# Set path(s) if different:
TEACHER_MODEL_PATH = "/home/jovyan/workspace-0/xxx-share-data/coding/ckpt/llama3/Llama-3.1-8B-nemo"
SEQ_LENGTH = 8192
DATA_PATH = "/home/jovyan/workspace-0/data-25/05-coding-pruning"
DATA_PATHS = {
"train": [1.0, f"{DATA_PATH}/250511_coding_cplt_pt_sample_1w_wikitext_tokenized_train_text_document"],
"validation": [f"{DATA_PATH}/250511_coding_cplt_pt_sample_1w_wikitext_tokenized_train_text_document"],
"test": [f"{DATA_PATH}/250511_coding_cplt_pt_sample_1w_wikitext_tokenized_train_text_document"],
}
INDEX_MAPPING_DIR = f"{DATA_PATH}/index_mappings"
student_model_path="/home/jovyan/workspace-0/xxx-share-data/coding/ckpt/llama3/Llama-3.1-8B-nemo-depth-pruned-2"
exp_name="Llama-3.1-8B-nemo-ft-depth-distilled-2"
exp_dir=f"/home/jovyan/workspace-0/xxx-share-data/coding/ckpt/llama3/{exp_name}"
# Change these to accommodate resources:
DEVICES = 8
NODES = 1
TENSOR_PARALLEL_SIZE = 8
PIPELINE_PARALLEL_SIZE = 1
MICRO_BATCH_SIZE = 1
# Change the fine-tuning recipe for your model and dataset (below values for demonstration purposes):
STEPS = 100
GLOBAL_BATCH_SIZE = 128
LR = 1e-4
MIN_LR = 1e-5
WARMUP_STEPS = 2
LOG_INTERVAL = 1
VAL_INTERVAL = 10
NUM_VAL_BATCHES = 5
def configure_recipe(student_model_path, exp_dir, exp_name):
# Define the recipe
recipe = distillation_recipe(
student_model_path=student_model_path,
teacher_model_path=TEACHER_MODEL_PATH,
name=exp_name,
num_nodes=NODES,
num_gpus_per_node=DEVICES,
)
recipe.resume.restore_config = run.Config(
RestoreConfig,
path=student_model_path,
)
recipe.log.explicit_log_dir = exp_dir
recipe.log.ckpt.every_n_train_steps = VAL_INTERVAL
del recipe.log.ckpt.train_time_interval
# Change dataset (default is Squad dataset)
recipe.data = run.Config(
llm.PreTrainingDataModule,
paths=DATA_PATHS,
index_mapping_dir=INDEX_MAPPING_DIR,
seq_length=SEQ_LENGTH,
micro_batch_size=MICRO_BATCH_SIZE,
global_batch_size=GLOBAL_BATCH_SIZE,
)
# Set the training parameters if you dont want to use the recipe defaults
recipe.trainer.max_steps = STEPS
recipe.trainer.log_every_n_steps = LOG_INTERVAL
recipe.trainer.val_check_interval = VAL_INTERVAL
recipe.trainer.limit_val_batches = NUM_VAL_BATCHES
recipe.trainer.strategy.tensor_model_parallel_size = TENSOR_PARALLEL_SIZE
recipe.trainer.strategy.pipeline_model_parallel_size = PIPELINE_PARALLEL_SIZE
recipe.trainer.strategy.sequence_parallel = TENSOR_PARALLEL_SIZE > 1
recipe.optim.config.lr = LR
recipe.optim.lr_scheduler.warmup_steps = WARMUP_STEPS
recipe.optim.lr_scheduler.min_lr = MIN_LR
return recipe
recipe = configure_recipe(student_model_path, exp_dir, exp_name)
print(recipe)
env_vars = {
"TORCH_NCCL_AVOID_RECORD_STREAMS": "1", # Disable caching NCCL communication buffer memory
"NCCL_NVLS_ENABLE": "0", # Disable NVLink SHARP to save memory
}
executor = run.LocalExecutor(ntasks_per_node=recipe.trainer.devices, launcher="torchrun", env_vars=env_vars)
run.run(recipe, executor=executor, name=exp_name)
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Environment overview (please complete the following information)
My Environment setup by
apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
# nemo-toolkit
cd /home/jovyan/workspace-0/xxx-proj/NeMo-2505
pip install -e ".[all]"
# nemo_run
cd /home/jovyan/workspace-0/xxx-proj/NeMo-Run-2505
pip install -e .
# megatron.core
cd /home/jovyan/workspace-0/xxx-proj/Megatron-LM-2505
pip install -e .
Environment details
If NVIDIA docker image is used you don't need to specify these. Otherwise, please provide:
- OS version:
- PyTorch version: 2.7.0
- Python version: 3.10.12
Additional context
Add any other context about the problem here. Example: GPU model
Hi @mZhenz can you share which NeMo container are you using? Is this from 25.04 container with nvidia-modelopt==0.27.1?
root@pruning-master-0:~# pip list | grep modelopt
nvidia-modelopt 0.27.1
nvidia-modelopt-core 0.27.1
Yes. I set up my evirnoment with this scripts. Using the latest main branch of NeMo/NeMo-Run/Megatron-LM.
pip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0
apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
# nemo-toolkit
cd /home/jovyan/workspace-0/llama-proj/NeMo-2505
pip install -e ".[all]"
# nemo_run
cd /home/jovyan/workspace-0/llama-proj/NeMo-Run-2505
pip install -e .
# megatron.core
cd /home/jovyan/workspace-0/llama-proj/Megatron-LM-2505
pip install -e .
Here is my full pip list.
Package Version Editable project location
----------------------------- --------------- ------------------------------------------------------
absl-py 2.1.0
accelerate 1.2.1
accelerated-scan 0.2.0
addict 2.4.0
aiofiles 24.1.0
aiohttp 3.9.5
aiosignal 1.3.1
alabaster 1.0.0
alembic 1.15.2
aniso8601 10.0.1
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 3.7.1
apex 0.1
APScheduler 3.10.4
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
asciitree 0.3.3
asttokens 2.4.1
astunparse 1.6.3
async-timeout 4.0.3
attrdict 2.0.1
attrs 23.2.0
audioread 3.0.1
av 14.3.0
babel 2.16.0
backoff 2.2.1
bcrypt 4.3.0
beautifulsoup4 4.13.1
bitsandbytes 0.45.3
black 24.10.0
bleach 6.1.0
blinker 1.9.0
blis 0.7.11
boto3 1.38.16
botocore 1.38.16
braceexpand 0.1.7
Brotli 1.1.0
cachetools 5.3.3
catalogue 2.0.10
cdifflib 1.2.6
certifi 2024.7.4
cffi 1.16.0
chardet 5.2.0
charset-normalizer 3.3.2
click 8.2.0
clip 0.2.0
cloudpathlib 0.18.1
cloudpickle 3.0.0
cmake 3.30.0
colorama 0.4.6
colorlog 6.9.0
comm 0.2.2
confection 0.1.5
contourpy 1.2.1
coverage 7.8.0
cryptography 42.0.8
cuda-python 12.5.0
cudf 24.4.0
cugraph 24.4.0
cugraph-dgl 24.4.0
cugraph-equivariant 24.4.0
cugraph-pyg 24.4.0
cugraph-service-client 24.4.0
cugraph-service-server 24.4.0
cuml 24.4.0
cupy-cuda12x 13.0.0
cycler 0.12.1
cymem 2.0.8
Cython 3.0.10
cytoolz 1.0.1
dask 2024.1.1
dask-cuda 24.4.0
dask-cudf 24.4.0
dask-expr 0.4.0
dataclasses-json 0.6.7
DataProperty 1.1.0
datasets 3.6.0
debugpy 1.8.2
decorator 5.1.1
decord 0.6.0
defusedxml 0.7.1
Deprecated 1.2.18
diffusers 0.33.1
dill 0.3.8
Distance 0.1.3
distlib 0.3.4
distributed 2024.1.1
distro 1.9.0
dm-tree 0.1.8
docker 7.1.0
docker-pycreds 0.4.0
docopt 0.6.2
docstring_parser 0.16
docutils 0.21.2
dropout-layer-norm 0.1
editdistance 0.8.1
einops 0.8.0
einops-exts 0.0.4
emoji 2.14.1
entrypoints 0.4
evaluate 0.4.3
exceptiongroup 1.2.1
execnet 2.1.1
executing 2.0.1
expecttest 0.1.3
fabric 3.2.2
faiss-cpu 1.11.0
fastapi 0.115.12
fasteners 0.19
fastjsonschema 2.20.0
fastrlock 0.8.2
fiddle 0.3.0
filelock 3.15.4
filetype 1.2.0
flash-attn 2.6.3
Flask 3.1.1
Flask-RESTful 0.3.10
fonttools 4.53.1
frozenlist 1.4.1
fsspec 2024.12.0
ftfy 6.3.1
future 1.0.0
g2p-en 2.1.0
gast 0.6.0
gdown 5.2.0
gevent 25.5.1
geventhttpclient 2.0.2
gitdb 4.0.12
GitPython 3.1.44
google-auth 2.32.0
google-auth-oauthlib 1.0.0
graphviz 0.20.3
greenlet 3.2.2
grouped-gemm 1.1.2
grpcio 1.72.0
grpcio-tools 1.59.2
gviz-api 1.10.0
h11 0.16.0
h5py 3.13.0
hawkeye-train 0.1.5.2
httpcore 1.0.9
httpx 0.27.0
huggingface-hub 0.31.2
hydra-core 1.3.2
hypothesis 5.35.1
idna 3.7
igraph 0.11.6
ijson 3.4.0
imageio 2.37.0
imagesize 1.4.1
immutabledict 4.2.0
importlib_metadata 7.1.0
inflect 7.5.0
iniconfig 2.0.0
inquirerpy 0.3.4
intel-openmp 2021.4.0
intervaltree 3.1.0
invoke 2.2.0
ipykernel 6.20.2
ipython 8.21.0
ipython-genutils 0.2.0
isort 5.13.2
itsdangerous 2.2.0
Janome 0.5.0
jedi 0.19.1
jieba 0.42.1
Jinja2 3.1.4
jiter 0.9.0
jiwer 3.1.0
jmespath 1.0.1
joblib 1.4.2
json5 0.9.25
jsonlines 4.0.0
jsonschema 4.23.0
jsonschema-specifications 2023.12.1
jupyter-client 7.1.2
jupyter-core 4.9.2
jupyter-server 1.13.5
jupyterlab 3.0.16
jupyterlab-pygments 0.1.2
jupyterlab-server 2.10.3
kaldi-python-io 1.2.2
kaldiio 2.18.1
kiwisolver 1.4.5
kornia 0.8.1
kornia_rs 0.1.9
kvikio 24.4.0
langcodes 3.4.0
langdetect 1.0.9
language_data 1.2.0
latexcodec 3.0.0
lazy_loader 0.4
leo2-client 2.0.7
Levenshtein 0.27.1
lhotse 1.31.0
libcst 1.7.0
librosa 0.10.1
lightning 2.4.0
lightning-thunder 0.2.0.dev0
lightning-utilities 0.11.3.post0
lilcom 1.8.1
lintrunner 0.12.5
llvmlite 0.44.0
locket 1.0.0
loguru 0.7.3
looseversion 1.3.0
lxml 5.4.0
Mako 1.3.10
marisa-trie 1.2.0
Markdown 3.6
markdown-it-py 3.0.0
markdown2 2.5.3
MarkupSafe 2.1.5
marshmallow 3.26.1
matplotlib 3.9.1
matplotlib-inline 0.1.7
mbstrdecoder 1.1.4
mdit-py-plugins 0.4.1
mdurl 0.1.2
mediapy 1.1.6
megablocks 0.4.0
megatron-core 0.13.0rc0 /home/jovyan/workspace-0/llama-proj/Megatron-LM-2505
megatron-energon 5.2.0
mistune 3.0.2
mkl 2021.1.1
mkl-devel 2021.1.1
mkl-include 2021.1.1
ml_dtypes 0.5.0
mock 5.1.0
more-itertools 10.7.0
mpmath 1.3.0
msgpack 1.0.8
multi-storage-client 0.20.3
multidict 6.0.5
multiprocess 0.70.16
murmurhash 1.0.10
mypy_extensions 1.1.0
nbclassic 0.5.6
nbclient 0.7.0
nbconvert 7.16.4
nbformat 5.10.3
nemo_run 0.5.0rc0.dev0 /home/jovyan/workspace-0/llama-proj/NeMo-Run-2505
nemo_text_processing 1.1.0
nemo-toolkit 2.4.0rc0 /home/jovyan/workspace-0/llama-proj/NeMo-2505
nerfacc 0.5.3
nest-asyncio 1.6.0
networkx 3.3
ninja 1.11.1.1
nltk 3.9.1
notebook 6.4.10
notebook_shim 0.2.4
num2words 0.5.14
numba 0.61.0
numcodecs 0.11.0
numexpr 2.10.2
numpy 1.26.4
nvfuser 0.2.6a0+f73ff1b
nvidia-cublas-cu12 12.6.4.1
nvidia-cuda-cupti-cu12 12.6.80
nvidia-cuda-nvrtc-cu12 12.6.77
nvidia-cuda-runtime-cu12 12.6.77
nvidia-cudnn-cu12 9.5.1.17
nvidia-cudnn-frontend 1.5.1
nvidia-cufft-cu12 11.3.0.4
nvidia-cufile-cu12 1.11.1.6
nvidia-curand-cu12 10.3.7.77
nvidia-cusolver-cu12 11.7.1.2
nvidia-cusparse-cu12 12.5.4.2
nvidia-cusparselt-cu12 0.6.3
nvidia-dali-cuda120 1.39.0
nvidia_lm_eval 25.4.1
nvidia-ml-py 12.575.51
nvidia-modelopt 0.27.1
nvidia-modelopt-core 0.27.1
nvidia-nccl-cu12 2.26.2
nvidia-nvimgcodec-cu12 0.2.0.7
nvidia-nvjitlink-cu12 12.6.85
nvidia-nvtx-cu12 12.6.77
nvidia-pyindex 1.0.9
nvidia-pytriton 0.5.14
nvidia-resiliency-ext 0.3.0
nvtx 0.2.5
nx-cugraph 24.4.0
oauthlib 3.2.2
omegaconf 2.3.0
onnx 1.16.0
open-clip-torch 2.24.0
openai 1.61.0
OpenCC 1.1.9
opencv-python 4.10.0.84
opentelemetry-api 1.33.0
opt-einsum 3.3.0
optree 0.12.1
optuna 4.3.0
packaging 24.0
pandas 2.2.1
pandocfilters 1.5.1
pangu 4.0.6.1
parameterized 0.9.0
paramiko 3.5.1
parso 0.8.4
partd 1.4.2
pathspec 0.12.1
pathvalidate 3.2.3
peft 0.15.2
pesq 0.0.4
pexpect 4.9.0
pfzy 0.3.4
pillow 10.4.0
pip 24.1.2
pipenv 11.9.0
plac 1.4.5
platformdirs 4.2.2
pluggy 1.5.0
ply 3.11
polygraphy 0.49.12
pooch 1.8.2
portalocker 3.1.1
preshed 3.0.9
prettytable 3.16.0
progress 1.6
prometheus_client 0.20.0
prompt_toolkit 3.0.47
protobuf 4.24.4
psutil 7.0.0
ptyprocess 0.7.0
PuLP 3.1.1
pure-eval 0.2.2
pyannote.core 5.0.0
pyannote.database 5.1.3
pyannote.metrics 3.2.1
pyarrow 18.1.0
pyarrow-hotfix 0.6
pyasn1 0.6.0
pyasn1_modules 0.4.0
pyavi 0.0.24
pybind11 2.13.1
pybind11_global 2.13.1
pybtex 0.24.0
pybtex-docutils 1.0.3
pycatbundle 3.1.8.1
pycocotools 2.0+nv0.8.0
pycparser 2.22
pycryptodome 3.19.0
pycryptodomex 3.19.0
pydantic 2.11.4
pydantic_core 2.33.2
pydantic-settings 2.9.1
pydub 0.25.1
Pygments 2.18.0
pyleo 1.1.0
pylibcugraph 24.4.0
pylibcugraphops 24.4.0
pylibraft 24.4.0
pylibwholegraph 24.4.0
pyloudnorm 0.1.1
PyNaCl 1.5.0
pynini 2.1.6.post1
pynvjitlink 0.2.3
pynvml 12.0.0
pyparsing 3.1.2
pypdf 5.5.0
pypinyin 0.54.0
pypinyin-dict 0.9.0
pyre-extensions 0.0.32
PySocks 1.7.1
pystoi 0.4.1
pytablewriter 1.2.1
pytest 8.1.1
pytest-cov 6.1.1
pytest-flakefinder 1.1.0
pytest-mock 3.14.0
pytest-random-order 1.1.1
pytest-rerunfailures 14.0
pytest-runner 6.0.1
pytest-shard 0.1.2
pytest-xdist 3.6.1
python-dateutil 2.9.0.post0
python-dotenv 1.1.0
python-hostlist 1.23.0
python-iso639 2025.2.18
python-magic 0.4.27
python-rapidjson 1.20
pytorch-lightning 2.5.1.post0
pytorch-triton 3.0.0+989adb9a2
pytz 2024.1
PyYAML 6.0.1
pyzmq 26.0.3
qwen-vl-utils 0.0.11
raft-dask 24.4.0
RapidFuzz 3.13.0
rapids-dask-dependency 24.4.0a0
referencing 0.35.1
regex 2024.5.15
requests 2.32.3
requests-oauthlib 2.0.0
requests-toolbelt 1.0.0
resampy 0.4.3
rich 13.7.1
rmm 24.4.0
rouge-score 0.1.2
rpds-py 0.19.0
rsa 4.9
ruamel.yaml 0.18.10
ruamel.yaml.clib 0.2.12
s3fs 0.4.2
s3transfer 0.12.0
sacrebleu 2.5.1
sacremoses 0.1.1
safetensors 0.4.5
scikit-learn 1.5.1
scipy 1.13.1
seaborn 0.13.2
Send2Trash 1.8.3
sentence-transformers 4.1.0
sentencepiece 0.2.0
sentry-sdk 2.28.0
setproctitle 1.3.6
setuptools 80.7.1
sh 2.2.2
shellingham 1.5.4
six 1.16.0
smart-open 7.0.4
smmap 5.0.2
sniffio 1.3.1
snowballstemmer 3.0.1
sortedcontainers 2.4.0
soundfile 0.12.1
soupsieve 2.5
sox 1.5.0
soxr 0.3.7
spacy 3.7.5
spacy-legacy 3.0.12
spacy-loggers 1.0.5
Sphinx 8.1.3
sphinxcontrib-applehelp 2.0.0
sphinxcontrib-bibtex 2.6.3
sphinxcontrib-devhelp 2.0.0
sphinxcontrib-htmlhelp 2.1.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 2.0.0
sphinxcontrib-serializinghtml 2.0.0
SQLAlchemy 2.0.41
srsly 2.4.8
stack-data 0.6.3
stanford-stk 0.7.0
starlette 0.46.2
sympy 1.14.0
tabledata 1.3.4
tabulate 0.9.0
taming-transformers 0.0.1
tbb 2021.13.0
tblib 3.0.0
tcolorpy 0.1.7
tenacity 9.1.2
tensorboard 2.14.0
tensorboard-data-server 0.7.2
tensorboard-plugin-profile 2.14.0
tensorboard-plugin-wit 1.8.1
tensorrt 10.2.0
tensorstore 0.1.71
termcolor 3.1.0
terminado 0.18.1
text-unidecode 1.3
textdistance 4.6.3
texterrors 0.5.1
texttable 1.7.0
thinc 8.2.5
threadpoolctl 3.5.0
thriftpy2 0.5.0
tiktoken 0.7.0
timm 1.0.15
tinycss2 1.3.0
tokenizers 0.21.0
toml 0.10.2
tomli 2.0.1
toolz 0.12.1
torch 2.7.0
torch-tensorrt 2.5.0a0
torchaudio 2.2.0
torchdiffeq 0.2.5
torchmetrics 1.7.1
torchprofile 0.0.4
torchsde 0.2.6
torchvision 0.22.0
torchx 0.7.0
tornado 6.4
tqdm 4.66.4
tqdm-multiprocess 0.0.11
traitlets 5.9.0
trampoline 0.1.2
transformer-engine 1.9.0+43d0d17
transformers 4.51.3
treelite 4.1.2
trimesh 4.6.9
triton 3.3.0
tritonclient 2.51.0
typeguard 4.4.2
typepy 1.3.4
typer 0.12.3
types-dataclasses 0.6.6
typing_extensions 4.13.2
typing-inspect 0.9.0
typing-inspection 0.4.0
tzdata 2024.1
tzlocal 5.2
ucx-py 0.37.0
unstructured 0.14.9
unstructured-client 0.35.0
urllib3 1.26.20
uvicorn 0.34.2
virtualenv 20.13.0+ds
virtualenv-clone 0.3.0
wandb 0.19.11
wasabi 1.1.3
wcwidth 0.2.13
weasel 0.4.1
webdataset 0.2.111
webencodings 0.5.1
websocket-client 1.8.0
Werkzeug 3.1.3
wget 3.2
wheel 0.43.0
word2number 1.1
wrapt 1.16.0
xdoctest 1.0.2
xgboost 2.0.3
xxhash 3.5.0
yarl 1.9.4
zarr 2.18.2
zict 3.0.0
zipp 3.19.0
zope.event 5.0
zope.interface 7.2
zstandard 0.23.0
Could you try the container without installing anything else inside?
Nemo and Megatron should already be in the PYTHONPATH at /opt/NeMo and /opt/megatron-lm