Verdi March issues

Results 24 issues of


                                            Verdi March

nccl-tests container: fix cuda driver mismatch

*Issue #, if available:* nccl-test with container image fails with `system has unsupported display driver / cuda driver combination`. *Description of changes:* - update cuda compat to fix error: ```text...

Extra containerized nccl tests

*Issue #, if available:* N/A *Description of changes:* sample .sbatch scripts to run nccl tests under containers. Two variants: native implementation, and a pure pytorch-based (that some of our customers...

Bump pytorch dockerfile template

*Issue #, if available:* N/A *Description of changes:* update PyTorch dockerfile template to 24.02. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this...

smhp: quality-of-live improvements

*Issue #, if available:* N/A *Description of changes:* Upstream from the playground repo. These are off by default. Edit the config.py to switch on. - ssh to compute nodes from...

Script to probe the nccl libraries that PyTorch uses

*Issue #, if available:* close #252 *Description of changes:* Probe what PyTorch actually uses for the nccl stacks. By submitting this pull request, I confirm that you can use, modify,...

Slurm job template: how a job can probe instance topology and hostname-instanceid mappings…

*Issue #, if available:* N/A *Description of changes:* a sample template on writing Slurm job that probes ec2 informations, so that job logs contain as much info as possible for...

SMHP: slurm exporter to report gpu metrics

*Issue #, if available:* N/A *Description of changes:* Prometheus Slurm exporter to report GPU metrics (total, allocated). By submitting this pull request, I confirm that you can use, modify, copy,...

megatron-lm test case: update README

*Issue #, if available:* N/A *Description of changes:* update README with tips and tricks. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this...

Prepare DLAMI for ParallelCluster using pcluster build-image

*Issue #, if available:* N/A *Description of changes:* Example to prepare DLAMI using `pcluster build-image` which does not require additional community tools (`ansible` and `packer`). By submitting this pull request,...

enhancement

Example 10.FSDP reports 35b model created instead of 70b

The README recommends these hyperparameters to train a 70b model: ```text --num_key_value_heads=8 --llama_intermediate_size=28672 --hidden_width=8192 --num_layers=80 --num_heads=64 ``` but the train script reports that it creates 35B model instead: ```text 0:...

stale