oci-hpc
oci-hpc copied to clipboard
Terraform examples for deploying HPC clusters on OCI
in case when terraform provisioning fails due to cluster network being only partially provisioned the terraform doesn't register it and thus is not removed during deletion adds double delete for...
Hello, the installation of oracle-cloud-agent fails because the bucket url is not accessible to public. ``` yum: name: "https://objectstorage.us-phoenix-1.oraclecloud.com/p/aV_mSl96KIiapAeZtsyo-SUcPCSurDfWaj06f4XVVoNKIsxvqlZ65guPTnMuNawR/n/imagegen/b/agent_test/o/1.37.0/3/oracle-cloud-agent-1.37.2-10459.el8.x86_64.rpm" state: present disable_gpg_check: yes ``` ``` {"code":"BucketNotFound","message":"Either the bucket named 'agent_test'...
for HPC field. sometimes customer will use ubuntu to deploy computing node, and I failed to test ubuntu as computing node and found the current version doesn't support it. do...
Not sure if this can be resolved, but I wonder if it would be possible to check if the nodes are available before provisioning the cluster network rather than provisioning...
I have an array job limited to 2 jobs at a time: ``` 2145_[4-190%2] compute EP_108 opc PD 0:00 10 (JobArrayTaskLimit) 2145_3 compute EP_108 opc R 48:42 10 compute-hpc-node-[100,373,397,421,425,429,455,457,813,896] 2145_2...
By default it seems that the only pmi installed in the cluster is mpich. It would be helpful to add also intel compatible pmi. I assume installing slurm-libpmi will help....
See [source](https://bugs.schedmd.com/show_bug.cgi?id=3941#c7) where they explain the default for UnkillableStepTimeout=60 which doesn't fit with the plugin.
Fix typo
To prevent accidental breaks because provider versions are not pinned.
If the monitoring variable is false the cron job will schedule to run an empty command very often. This is to comment out the entire line as is done with...