Not iterating through all GPUs in system (3 of them), stuck at 1st in the list and hangs forever. 1070, 750 Ti, 950

Jul 21 '20 04:07 Delitants

/usr/bin/coolgpus --speed 60 60
No existing X servers, we're good to go
Starting xserver: Xorg :0 -once -config /tmp/cool-gpu-00000000:01:00.0rr5gi2u3/xorg.conf
Starting xserver: Xorg :1 -once -config /tmp/cool-gpu-00000000:05:00.0f1ewmqu3/xorg.conf
Starting xserver: Xorg :2 -once -config /tmp/cool-gpu-00000000:09:00.0v06tfhl_/xorg.conf

X.Org X Server 1.19.5
Release Date: 2017-10-12
X Protocol Version 11, Revision 0
Build Operating System:  3.10.0-693.17.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 29 October 2018  03:33:19PM
Build ID: xorg-x11-server 1.19.5-5.1.el7_5.0.1
Current version of pixman: 0.34.0
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Tue Jul 21 00:55:38 2020
(++) Using config file: "/tmp/cool-gpu-00000000:01:00.0rr5gi2u3/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"

X.Org X Server 1.19.5
Release Date: 2017-10-12
X Protocol Version 11, Revision 0
Build Operating System:  3.10.0-693.17.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 29 October 2018  03:33:19PM
Build ID: xorg-x11-server 1.19.5-5.1.el7_5.0.1
Current version of pixman: 0.34.0
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.1.log", Time: Tue Jul 21 00:55:38 2020
(++) Using config file: "/tmp/cool-gpu-00000000:05:00.0f1ewmqu3/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"

X.Org X Server 1.19.5
Release Date: 2017-10-12
X Protocol Version 11, Revision 0
Build Operating System:  3.10.0-693.17.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 29 October 2018  03:33:19PM
Build ID: xorg-x11-server 1.19.5-5.1.el7_5.0.1
Current version of pixman: 0.34.0
        Before reporting problems, check http://wiki.x.org
        to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
        (++) from command line, (!!) notice, (II) informational,
        (WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.2.log", Time: Tue Jul 21 00:55:38 2020
(++) Using config file: "/tmp/cool-gpu-00000000:09:00.0v06tfhl_/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
GPU :0, 58C -> [60%-60%]. Setting speed to 60%

_**hanged**_
^C
Released fan speed control for GPU at :0
_**hanged**_

nvidia-smi
Tue Jul 21 01:12:22 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0 Off |                  N/A |
| 60%   53C    P2    69W / 151W |   2997MiB /  8119MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 750 Ti  Off  | 00000000:05:00.0 Off |                  N/A |
| 48%   56C    P0    30W /  38W |   1229MiB /  2002MiB |     83%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 950     Off  | 00000000:09:00.0 Off |                  N/A |
| 32%   71C    P0    45W /  75W |    974MiB /  2002MiB |     51%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Jul 21 '20 05:07 Delitants

There's some troubleshooting advice on the main page; mind stepping through it?

If you're uncomfortable with pdb, looking at where things are hanging I think a good alternative would be to add print(command) above this line and see what command is making things hang. Then try and run that command manually.

Jul 21 '20 05:07 andyljones

There's some troubleshooting advice on the main page; mind stepping through it?

If you're uncomfortable with pdb, looking at where things are hanging I think a good alternative would be to add print(command) above this line and see what command is making things hang. Then try and run that command manually.

I didn't understand what pdb should do, I saw no changes.

Added print,

(==) Log file: "/var/log/Xorg.1.log", Time: Tue Jul 21 01:40:49 2020 (++) Using config file: "/tmp/cool-gpu-00000000:05:00.0prjzqd_m/xorg.conf" (==) Using config directory: "/etc/X11/xorg.conf.d" (==) Using system config directory "/usr/share/X11/xorg.conf.d" ['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':0'] ['nvidia-settings', '-a', '[fan:0]/GPUTargetFanSpeed=60', '-c', ':0'] GPU :0, 60C -> [60%-60%]. Setting speed to 60% ['nvidia-smi', '--format=csv,noheader', '--query-gpu=temperature.gpu', '-i', '00000000:05:00.0'] ['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':1'] ['nvidia-settings', '-a', '[fan:0]/GPUTargetFanSpeed=60', '-c', ':1']

this is it

Jul 21 '20 05:07 Delitants

Try running

nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1

in a terminal.

Adding .set_trace() somewhere should drop you in an interactive debugger prompt when you run the program.

Jul 21 '20 05:07 andyljones

nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1

ERROR: Unable to find display on any available system

First 2 gpus have no physical displays connected, 3rd one has a physical display connected. However, even without physical display, problem is the same.

Jul 21 '20 05:07 Delitants

I'd like to mention, that I was able to set all fans in some random attempts, I don't even remember what I was doing, rebooted couple times and reinstalled driver, replugged physical display, and I had script passed a few times, but after closing it and running again, it hangs.

Jul 21 '20 05:07 Delitants

I don't get this...

/usr/bin/coolgpus --speed 60 60
> /usr/bin/coolgpus(11)<module>()
-> parser = argparse.ArgumentParser(description=r'''
(Pdb)
(Pdb)
(Pdb)
(Pdb) help
(Pdb) run
Traceback (most recent call last):
  File "/usr/bin/coolgpus", line 11, in <module>
    parser = argparse.ArgumentParser(description=r'''
  File "/usr/bin/coolgpus", line 11, in <module>
    parser = argparse.ArgumentParser(description=r'''
  File "/usr/lib64/python3.6/bdb.py", line 51, in trace_dispatch
    return self.dispatch_line(frame)
  File "/usr/lib64/python3.6/bdb.py", line 69, in dispatch_line
    self.user_line(frame)
  File "/usr/lib64/python3.6/pdb.py", line 261, in user_line
    self.interaction(frame, None)
  File "/usr/lib64/python3.6/pdb.py", line 352, in interaction
    self._cmdloop()
  File "/usr/lib64/python3.6/pdb.py", line 321, in _cmdloop
    self.cmdloop()
  File "/usr/lib64/python3.6/cmd.py", line 138, in cmdloop
    stop = self.onecmd(line)
  File "/usr/lib64/python3.6/pdb.py", line 418, in onecmd
    return cmd.Cmd.onecmd(self, line)
  File "/usr/lib64/python3.6/cmd.py", line 217, in onecmd
    return func(arg)
  File "/usr/lib64/python3.6/pdb.py", line 1028, in do_run
    raise Restart
pdb.Restart

Jul 21 '20 05:07 Delitants

coolgpus will not work on any system with a display or any system that's expecting a display. You'll need to remove the display, restart, SSH in, and toy around until

nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1

works for -c :0, -c :1, -c :2. Googling the error message has some possible directions. But this is, I'm afraid, much more about debugging your system setup than it is about debugging the script.

Jul 21 '20 05:07 andyljones

Take a look at the pdb docs. No longer useful for this problem, but overall one of the most useful tools in Python programming. Especially the .pm() bit.

Jul 21 '20 05:07 andyljones

I did not have a display connected initially, it makes no difference. I've connected it at the last attempt to see what changes. Well, nothing.

Jul 21 '20 05:07 Delitants

nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1

ERROR: Unable to find display on any available system

[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :0

ERROR: Unable to find display on any available system

[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :2

ERROR: Unable to find display on any available system

[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :3

ERROR: Unable to find display on any available system

[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :4

ERROR: Unable to find display on any available system

Jul 21 '20 06:07 Delitants

System is a plain Centos 7 with nvidia driver for headless transcoding, there is nothing custom to debug on it.

Jul 21 '20 06:07 Delitants

"Unable to find display on any available system" Why would it find if I removed the physical display as you suggested? Doesn't make any sense.

Jul 21 '20 06:07 Delitants

Not to discourage you too much but: it feels like you're hoping that I have more knowledge about this than I do. Your system absolutely has something to debug, as you can tell by the way a thing you want to do isn't working as you'd expect it to.

If you want to push forward with this, a general loop should be:

Take anything you know about the problem you're seeing (ie, ERROR: Unable to find display on any available system) and Google until you find people with similar problems.
Try out their fixes.
If their fixes don't work, think about how your case differs from their case, or what you can do to more accurately isolate the problem you're seeing.
Go back to doing more Googling.

It's hard! This might take hours or days! You might have to learn huge amounts about subjects that are totally irrelevant, just to check one possible fix! It probably won't be worth it! But, frankly: the only other choice is to give up and decide you don't care that much about coolgpu's functionality.

Jul 21 '20 06:07 andyljones

Unable to find display on any available - is literally what it says, no displays attached, either physical or virtual. Doesn't your script do a virtual displays to set fan speeds? Apparently it does, because I'm able to run "nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :2" but no ouput comes, while having your script open in another SSH window. So I don't understand what exactly or why "Unable to find display on any available" message has to be fixed. It is expected and not related to the issue described.

Jul 21 '20 06:07 Delitants

OK - in that case, use pdb, use print statements - figure out where it is the script is actually hanging, then make sure you can replicate it yourself, then Google around to figure out what's causing that hanging.

You may need to replicate the xserver setup the script is doing to replicate the hanging. You might want to tear the xserver setup bit of coolgpus out into your own script, then run that and leave it running in the background while you experiment. There're lots of ways forward, just requires a bit of ingenuity!

Jul 21 '20 06:07 andyljones

I did print already and posted earlier, it hangs on

['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':0'] ['nvidia-settings', '-a', '[fan:0]/GPUTargetFanSpeed=60', '-c', ':0'] GPU :0, 60C -> [60%-60%]. Setting speed to 60% ['nvidia-smi', '--format=csv,noheader', '--query-gpu=temperature.gpu', '-i', '00000000:05:00.0'] ['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':1'] ^C Google is useless here, I've spent 5 hours today before writing here.

I always have to kill Xorg like this, because it never exists. killall Xorg -9

Reinstalling xorg server does not help.

Jul 21 '20 06:07 Delitants

Ok, let's make it easier. I need to adjust only last GPU in list, how to select only one GPU with this script and skip others?

Jul 21 '20 06:07 Delitants

Right! That's the spirit.

The answer is: there's no built in method. Try cloning this repo and editing the script yourself; add a conditional somewhere to only look at specific GPUs.

More generally, you know where the script hangs but you haven't isolated the aberrant behaviour. You want to be able to enter a series of commands into the terminal and get the same hang. Then you can experiment freely with that series of commands, try different versions, add --verbose flags, etc etc etc.

Jul 21 '20 06:07 andyljones

I think I get the same error when running nvidia-settings from terminal

$ nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :0
Unable to init server: Could not connect: Connection refused
ERROR: Unable to find display on any available system
ERROR: Unable to find display on any available system

At the same time, I am able to run coolgpus from a conda environment. @Neolo can you try to run the script from this environment

name: coolgpus
channels:
  - defaults
dependencies:
  - _libgcc_mutex=0.1=main
  - ca-certificates=2020.1.1=0
  - certifi=2019.11.28=py38_0
  - ld_impl_linux-64=2.33.1=h53a641e_7
  - libedit=3.1.20181209=hc058e9b_0
  - libffi=3.2.1=hd88cf55_4
  - libgcc-ng=9.1.0=hdf63c60_0
  - libstdcxx-ng=9.1.0=hdf63c60_0
  - ncurses=6.2=he6710b0_0
  - openssl=1.1.1d=h7b6447c_4
  - pip=20.0.2=py38_1
  - python=3.8.1=h0371630_1
  - readline=7.0=h7b6447c_5
  - setuptools=45.2.0=py38_0
  - sqlite=3.31.1=h7b6447c_0
  - tk=8.6.8=hbc83047_0
  - wheel=0.34.2=py38_0
  - xz=5.2.4=h14c3975_4
  - zlib=1.2.11=h7b6447c_3
  - pip:
    - coolgpus==0.17

Jul 21 '20 06:07 v-iashin

Yep, I think @Neolo is right about the ERROR being a symptom of the missing xserver env. Still expect that command to be the source of the hang since it was the last command printed, just gonna need more work to get a manual reproduction.

I'll be surprised if the env is causing the hang, but it's a good idea since it's an easy thing to check.

Jul 21 '20 06:07 andyljones

So weird. I just removed contents of /etc/X11/ and ran

nvidia-xconfig --allow-empty-initial-configuration --enable-all-gpus --cool-bits=28 --separate-x-screens --enable-all-gpus --use-display-device=none

Using X configuration file: "/etc/X11/xorg.conf". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0 (1)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0 (2)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0 (3)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1 (1)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1 (2)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1 (3)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2 (1)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2 (2)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2 (3)". Backed up file '/etc/X11/xorg.conf' as '/etc/X11/xorg.conf.backup' New X configuration file written to '/etc/X11/xorg.conf'

and right after for one time only I was able to pass fan speeds to 1st AND 2nd gpus only, 3rd hanged. Closed the script, ran again - same problem, only 1st gpu now, again...

Jul 21 '20 06:07 Delitants

At the same time, I am able to run coolgpus from a conda environment. @Neolo can you try to run the script from this environment

I'm not into python, tell me what to run, I didn't get it.

Jul 21 '20 06:07 Delitants

I'm not into python, tell me what to run, I didn't get it.

https://conda.io/projects/conda/en/latest/user-guide/install/linux.html (please follow the instructions for Miniconda)
Verify installation (conda --help)
Save the environment I shared with you into env.yml
conda env create -f env.yml -- it should install the environment to your machine
conda activate env -- activating the virtual env
Run some examples from the README.md to see if it hangs in the same way.

Jul 21 '20 06:07 v-iashin

Will try that environment tomorrow.

The answer is: there's no built in method. Try cloning this repo and editing the script yourself; add a conditional somewhere to only look at specific GPUs.

As for now, I just made a dirty trick to select a last GPU from the list, which is "burning" right now at 72 C.

def gpu_buses():
#    return log_output(['nvidia-smi', '--format=csv,noheader', '--query-gpu=pci.bus_id']).splitlines()
    return '00000000:09:00.0'.splitlines()

and it sets the speed fine, no hangs,

Jul 21 '20 07:07 Delitants

I'm not into python, tell me what to run, I didn't get it.

https://conda.io/projects/conda/en/latest/user-guide/install/linux.html (please follow the instructions for Miniconda)

Verify installation (conda --help)

Save the environment I shared with you into env.yml

conda env create -f env.yml -- it should install the environment to your machine

conda activate env -- activating the virtual env

Run some examples from the README.md to see if it hangs in the same way.

Installed Miniconda, activated env, running "$(which coolgpus) --temp 60 60" just doesn't do anything, not even setting the first GPU at all.

[root@nvidia-2 ~]# conda env create -f env.yml [root@nvidia-2 ~]# conda activate coolgpus (coolgpus) [root@nvidia-2 ~]# conda -V conda 4.8.3 (coolgpus) [root@nvidia-2 ~]# $(which coolgpus) --temp 60 60 No existing X servers, we're good to go Starting xserver: Xorg :0 -once -config /tmp/cool-gpu-00000000:01:00.0qa_grbj8/xorg.conf Starting xserver: Xorg :1 -once -config /tmp/cool-gpu-00000000:05:00.089yczaej/xorg.conf Starting xserver: Xorg :2 -once -config /tmp/cool-gpu-00000000:09:00.0khgo7uqs/xorg.conf

X.Org X Server 1.19.3 Release Date: 2017-03-15 X Protocol Version 11, Revision 0 Build Operating System: 3.10.0-514.16.1.el7.x86_64 Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet Build Date: 05 August 2017 06:19:43AM Build ID: xorg-x11-server 1.19.3-11.el7 Current version of pixman: 0.34.0 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.0.log", Time: Wed Jul 22 19:06:50 2020 (++) Using config file: "/tmp/cool-gpu-00000000:01:00.0qa_grbj8/xorg.conf" (==) Using system config directory "/usr/share/X11/xorg.conf.d"

X.Org X Server 1.19.3 Release Date: 2017-03-15 X Protocol Version 11, Revision 0 Build Operating System: 3.10.0-514.16.1.el7.x86_64 Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet Build Date: 05 August 2017 06:19:43AM Build ID: xorg-x11-server 1.19.3-11.el7 Current version of pixman: 0.34.0 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.2.log", Time: Wed Jul 22 19:06:50 2020 (++) Using config file: "/tmp/cool-gpu-00000000:09:00.0khgo7uqs/xorg.conf" (==) Using system config directory "/usr/share/X11/xorg.conf.d"

X.Org X Server 1.19.3 Release Date: 2017-03-15 X Protocol Version 11, Revision 0 Build Operating System: 3.10.0-514.16.1.el7.x86_64 Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet Build Date: 05 August 2017 06:19:43AM Build ID: xorg-x11-server 1.19.3-11.el7 Current version of pixman: 0.34.0 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.1.log", Time: Wed Jul 22 19:06:50 2020 (++) Using config file: "/tmp/cool-gpu-00000000:05:00.089yczaej/xorg.conf" (==) Using system config directory "/usr/share/X11/xorg.conf.d" Released fan speed control for GPU at :0 Released fan speed control for GPU at :1 ^C

Jul 22 '20 23:07 Delitants

I am having the same issue and nobody can solve the issue, What I do is to create a custom xorg file that its working for me, and then on the nvidia app on ubuntu I can on powermizer change the settings of each fan for each GPUs, this is not usefull when working via SSH on headless but unfourtunatly there is no solution anywhere for an easy to use nvidia-settings.

So basically this is what worked for me

Table of content:

How change GPU fan speeds in Ubuntu

1- In the applications, open NVIDIA X Server Settings

2- Select the GPU currently used for display output (should be the GPU in first PCIe slot)

3- Take note of the Bus ID

4- Run the following commands

*sudo nvidia-xconfig --enable-all-gpus
sudo nvidia-xconfig --cool-bits=28
sudo reboot*

5- After the computer reboots, plug the monitor into the last GPU

6- Open NVIDIA X-Server Settings again

7- Select the GPU currently used for display output

8- Take note of the Bus ID

9- Run sudo nano /etc/X11/xorg.conf The GPUs will be listed in “Device” sections with formatting similar to this:

**Section** “Device”
**Identifier** “name”
**Driver** “driver”entries…
**EndSection**

10- Identify the GPUs with the Bus IDs that were previously noted

11- Swap the Bus IDs of the two GPUs

12- Press Ctrl+X to close “xorg.conf”

13- Press Y to save the file

14- Press “Enter” without changing the file name

15- Reboot

Fan speeds can now be changed from NVIDIA X Server Settings by selecting the Thermal Settings for each GPU and checking the option to “Enable GPU Fan Settings” Set the fan speed with the slider and click “Apply” to save it

Oct 01 '20 21:10 Gimel12

Never version - worse it works.

`(==) Log file: "/var/log/Xorg.2.log", Time: Mon Jan 25 22:02:05 2021 (++) Using config file: "/tmp/cool-gpu-00000000:09:00.0ngck_2l3/xorg.conf" (==) Using config directory: "/etc/X11/xorg.conf.d" (==) Using system config directory "/usr/share/X11/xorg.conf.d" GPU :0, 66C -> [60%-65%]. Setting speed to 60% GPU :1, 36C -> [30%-30%]. Setting speed to 30%

Command timed out: nvidia-settings -a [gpu:0]/GPUFanControlState=1 -c :1

Released fan speed control for GPU at :0 Command timed out: nvidia-settings -a [gpu:0]/GPUFanControlState=0 -c :1

Terminating xserver for display :0 Terminating xserver for display :1 Terminating xserver for display :2 Traceback (most recent call last): File "/usr/bin/coolgpus", line 89, in log_output p.wait(60) File "/usr/lib64/python3.6/subprocess.py", line 1469, in wait raise TimeoutExpired(self.args, timeout) subprocess.TimeoutExpired: Command '['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':1']' timed out after 60 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/bin/coolgpus", line 239, in manage_fans set_speed(display, s) File "/usr/bin/coolgpus", line 224, in set_speed assign(display, '[gpu:0]/GPUFanControlState=1') File "/usr/bin/coolgpus", line 221, in assign log_output(['nvidia-settings', '-a', command, '-c', display]) File "/usr/bin/coolgpus", line 102, in log_output raise ValueError('Command crashed with return code ' + str(p.returncode) + ': ' + ' '.join(command)) ValueError: Command crashed with return code None: nvidia-settings -a [gpu:0]/GPUFanControlState=1 -c :1

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/bin/coolgpus", line 89, in log_output p.wait(60) File "/usr/lib64/python3.6/subprocess.py", line 1469, in wait raise TimeoutExpired(self.args, timeout) subprocess.TimeoutExpired: Command '['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=0', '-c', ':1']' timed out after 60 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/bin/coolgpus", line 266, in run() File "/usr/bin/coolgpus", line 263, in run manage_fans(displays) File "/usr/bin/coolgpus", line 246, in manage_fans assign(display, '[gpu:0]/GPUFanControlState=0') File "/usr/bin/coolgpus", line 221, in assign log_output(['nvidia-settings', '-a', command, '-c', display]) File "/usr/bin/coolgpus", line 102, in log_output raise ValueError('Command crashed with return code ' + str(p.returncode) + ': ' + ' '.join(command)) ValueError: Command crashed with return code None: nvidia-settings -a [gpu:0]/GPUFanControlState=0 -c :1`

Jan 26 '21 03:01 Delitants

Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/bin/coolgpus", line 266, in Jan 25 22:17:24 nvidia-2 coolgpus: run() Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/bin/coolgpus", line 259, in run Jan 25 22:17:24 nvidia-2 coolgpus: with xservers(buses) as displays: Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/lib64/python3.6/contextlib.py", line 81, in enter Jan 25 22:17:24 nvidia-2 coolgpus: return next(self.gen) Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/bin/coolgpus", line 172, in xservers Jan 25 22:17:24 nvidia-2 coolgpus: kill_xservers() Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/bin/coolgpus", line 161, in kill_xservers Jan 25 22:17:24 nvidia-2 coolgpus: raise IOError('Failed to kill existing X servers. Try killing them yourself before running this script') Jan 25 22:17:24 nvidia-2 coolgpus: OSError: Failed to kill existing X servers. Try killing them yourself before running this script Jan 25 22:17:24 nvidia-2 systemd: coolgpus.service: main process exited, code=exited, status=1/FAILURE Jan 25 22:17:24 nvidia-2 kill: Usage: Jan 25 22:17:24 nvidia-2 systemd: coolgpus.service: control process exited, code=exited status=1 Jan 25 22:17:24 nvidia-2 kill: kill [options] <pid|name> [...] Jan 25 22:17:24 nvidia-2 kill: Options: Jan 25 22:17:24 nvidia-2 kill: -a, --all do not restrict the name-to-pid conversion to processes Jan 25 22:17:24 nvidia-2 kill: with the same uid as the present process Jan 25 22:17:24 nvidia-2 kill: -s, --signal send specified signal Jan 25 22:17:24 nvidia-2 kill: -q, --queue use sigqueue(2) rather than kill(2) Jan 25 22:17:24 nvidia-2 kill: -p, --pid print pids without signaling them Jan 25 22:17:24 nvidia-2 kill: -l, --list [=] list signal names, or convert one to a name Jan 25 22:17:24 nvidia-2 kill: -L, --table list signal names and numbers Jan 25 22:17:24 nvidia-2 kill: -h, --help display this help and exit Jan 25 22:17:24 nvidia-2 kill: -V, --version output version information and exit Jan 25 22:17:24 nvidia-2 kill: For more details see kill(1). Jan 25 22:17:24 nvidia-2 systemd: Unit coolgpus.service entered failed state. Jan 25 22:17:24 nvidia-2 systemd: coolgpus.service failed. Jan 25 22:17:29 nvidia-2 systemd: coolgpus.service holdoff time over, scheduling restart. Jan 25 22:17:29 nvidia-2 systemd: Starting Headless GPU Fan Control... Jan 25 22:17:39 nvidia-2 systemd: Started Headless GPU Fan Control.

**for Gods sake.... ridiculous Open /usr/bin/coolgpus add on top:

import subprocess

and in the function kill_xservers() on top of it:

subprocess.run(["killall", "Xorg -9"])
return

ditch the rest of this function. Solved.**

Jan 26 '21 03:01 Delitants

For anyone still experiencing this issue, I have slapped together a bash script which at least allows for setting a fixed fan speed for all GPU in the system, regardless if a monitor is attached. It supports amdgpu too: https://github.com/lavanoid/Linux_GPU_Fan_Control

Apr 08 '21 06:04 ghost

Not iterating through all GPUs in system (3)

How change GPU fan speeds in Ubuntu