Sean Smith
Sean Smith
If you've installed aws-ofi-nccl from *conda* and have a system with version of libfabric `
If you experience an NCCL slowdown the first step is to enable: ```bash export NCCL_DEBUG=INFO ``` This will allow you to catch an misconfigurations in the logs, for example if...
If you see the following issue in your code after setting `FI_INFO=info`: ``` libfabric:652244:1713524816::core:core:cuda_set_sync_memops():207 Failed to perform cuPointerSetAttribute: CUDA_ERROR_NOT_SUPPORTED:operation not supported libfabric:652244:1713524816::efa:mr:efa_mr_hmem_setup():254 unable to set memops for cuda ptr libfabric:652244:1713524816::efa:mr:efa_mr_regattr():1014...
If you're trying to connect to your SageMaker Hyperpod cluster and you see an error "An error occurred (TargetNotConnected)", there's a couple of common causes: ``` An error occurred (TargetNotConnected)...
The `provisioning_parameters.json` needs to be valid json or the cluster creation will fail, for example the following json is missing the `partition_name` value: ```json { "version": "1.0.0", "workload_manager": "slurm", "controller_group":...
We should add the following snippet to all Slurm examples so that if it's a hyperpod cluster it'll automatically add the `--auto-resume=1` flag. This needs to be tested for all...
https://github.com/aws-samples/awsome-distributed-training/blob/d66304ff17229dd857397d725ed9e168bc41167f/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_enroot_pyxis.sh#L47
This adds a script `easy-ssh.sh` that makes it easy to connect to AWS ParallelCluster using SSM. ```bash $ ./easy-ssh.sh [cluster-name] ``` Which will output: ``` Instance Id: i-0096542c11ccb02b5 Os: ubuntu2004...
If you see the following error when building a dockerfile: ``` sh: 1: Bad substitution ``` It's likely caused by your dockerfile running `sh` and not `bash` which doesn't support...
Hi, I'm getting the following error when setting up a new computer: ``` ERROR render of "taxonomy" failed: "/Users/seaam/projects/swsmith.cc/themes/archie/layouts/_default/baseof.html:8:12": execute of template failed: template: _default/terms.html:8:12: executing "_default/terms.html" at : error...