Verdi March

Results 24 issues of Verdi March

*Issue #, if available:* nccl-test with container image fails with `system has unsupported display driver / cuda driver combination`. *Description of changes:* - update cuda compat to fix error: ```text...

*Issue #, if available:* N/A *Description of changes:* sample .sbatch scripts to run nccl tests under containers. Two variants: native implementation, and a pure pytorch-based (that some of our customers...

*Issue #, if available:* N/A *Description of changes:* update PyTorch dockerfile template to 24.02. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this...

*Issue #, if available:* N/A *Description of changes:* Upstream from the playground repo. These are off by default. Edit the config.py to switch on. - ssh to compute nodes from...

*Issue #, if available:* close #252 *Description of changes:* Probe what PyTorch actually uses for the nccl stacks. By submitting this pull request, I confirm that you can use, modify,...

*Issue #, if available:* N/A *Description of changes:* a sample template on writing Slurm job that probes ec2 informations, so that job logs contain as much info as possible for...

*Issue #, if available:* N/A *Description of changes:* Prometheus Slurm exporter to report GPU metrics (total, allocated). By submitting this pull request, I confirm that you can use, modify, copy,...

*Issue #, if available:* N/A *Description of changes:* update README with tips and tricks. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this...

*Issue #, if available:* N/A *Description of changes:* Example to prepare DLAMI using `pcluster build-image` which does not require additional community tools (`ansible` and `packer`). By submitting this pull request,...

enhancement

The README recommends these hyperparameters to train a 70b model: ```text --num_key_value_heads=8 --llama_intermediate_size=28672 --hidden_width=8192 --num_layers=80 --num_heads=64 ``` but the train script reports that it creates 35B model instead: ```text 0:...

stale