Sean Smith

Results 36 issues of Sean Smith

Hi, Using the following cluster template breaks the create interface after the Storage Tab with: From the Chrome console I see: ```bash framework-bb5c596eafb42b22.js:1 TypeError: Cannot read properties of undefined (reading...

bug

My map stopped displaying anything today after several years of usage. I’m assuming this means either the aviationweather.gov API is down or they’ve changed the API format. I confirmed and...

I'm getting a couple of errors. Any advice on how to fix this is appreciated! ``` with-readline.c:59:11: error: use of undeclared identifier 'rl_gnu_readline_p' rl_gnu_readline_p ? "GNU" : "non-GNU", with-readline.c:379:5: error:...

Not sure if there's a setup step that I'm missing here but when I run the included Windows or Linux DCV job I get: ``` sbatch failed (parameters: -J Linux_Desktop...

*Issue #, if available:* *Description of changes:* By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

This script adds libfabric, nvidia driver version and cuda version. This covers everything in `efa-versions.sh` so I removed that script. ``` $ srun python3 efa-versions.py +--------------------------+--------------+ | Package | Version...

https://github.com/aws-samples/awsome-distributed-training/blob/8214ab770e0c882e711c393f0840fda7bc06597d/2.ami_and_containers/1.amazon_machine_image/packer-ami.pkr.hcl#L26 See https://docs.aws.amazon.com/parallelcluster/latest/ug/document_history.html

If you see the following error when installing [install_slurm_exporter.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_slurm_exporter.sh) ``` E: Failed to fetch https://proget.makedeb.org/debian-feeds/prebuilt-mpr/focal/golang-go/amd64/golang-go_amd64_2:1.22.2-1.deb File has unexpected size (71708458 != 71736104). Mirror sync in progress? [IP: 104.237.134.92 443] ```...

stale
Troubleshooting Tips

The NCCL tests for K8 have a limit of 8GB for the container, this is causing a OOM issue when run. https://github.com/aws-samples/awsome-distributed-training/blob/a99d6cd0f48abeecfa7d5a7710af4eb0a7079752/micro-benchmarks/nccl-tests/kubernetes/nccl-tests.yaml#L91 This results in an issue that looks like:...

Add a template for P5's here: https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/4.amazon-eks

stale