deepops icon indicating copy to clipboard operation
deepops copied to clipboard

Tools for building GPU clusters

Results 11 deepops issues
Sort by recently updated
recently updated
newest added

I found that on this repo, the team has the role for nis_client. But it seems to not be ready to run on any playbook. Do I miss something or...

Currently our playbook for installing the DGX stack supports RHEL 7, but not RHEL 8. RHEL 8 is supported in the [manual procedure](https://docs.nvidia.com/dgx/dgx-rhel8-install-guide/index.html) for setting up a DGX, so we...

no-stale

I installed DeepOps v23.08 last year, using K8s and the default settings. But now I need to upgrade the NVIDIA Driver so I can use a newer version of CUDA...

at this moment deault slurm 30 sec KillWait used, but this may be insufficient for large parallel jobs to gracefully terminate. to remind it define period in second between try...

Under some circumstances slurm epilog fail to cleanup processes because of parsing of nvidia-smi pmon From /var/log/slurm/prolog-epilog + for i in $(nvidia-smi pmon -c 1 | tail -n+3 | awk...

possible solution for #1317 i've opened Update the KillWait parameter in slurm.conf from 30 to 120 seconds to allow for more graceful job termination. This change ensures that jobs have...

Simple resolution for issue #1315 i've opened earlier Remove redundant 'tail' command in GPU process cleanup checks to ensure more accurate detection and termination of residual GPU processes. This change...

Ubuntu 22.04.4 LTS (Jammy Jellyfish) ansible [core 2.11.12] config file = /as/deepops/ansible.cfg configured module search path = ['/as/deepops/submodules/kubespray/library'] ansible python module location = /opt/deepops/env/lib/python3.10/site-packages/ansible ansible collection location = /as/deepops/collections executable...

ansible-galaxy is failing to download collections with error: ` Updating Ansible Galaxy roles... [WARNING]: Skipping Galaxy server https://galaxy.ansible.com/api/. Got an unexpected error when getting available versions of collection community.general: '/api/v3/plugin/ansible/content/published/collections/index/...

**Issue:** Ansible playbook failing to add RHEL 8 DGX Node in K8s cluster, kubelet service is getting crashed. **Issue Details:** One of the DGX RHEL7 node got failed in cluster...