Not iterating through all GPUs in system (3)
Not iterating through all GPUs in system (3 of them), stuck at 1st in the list and hangs forever. 1070, 750 Ti, 950
/usr/bin/coolgpus --speed 60 60
No existing X servers, we're good to go
Starting xserver: Xorg :0 -once -config /tmp/cool-gpu-00000000:01:00.0rr5gi2u3/xorg.conf
Starting xserver: Xorg :1 -once -config /tmp/cool-gpu-00000000:05:00.0f1ewmqu3/xorg.conf
Starting xserver: Xorg :2 -once -config /tmp/cool-gpu-00000000:09:00.0v06tfhl_/xorg.conf
X.Org X Server 1.19.5
Release Date: 2017-10-12
X Protocol Version 11, Revision 0
Build Operating System: 3.10.0-693.17.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 29 October 2018 03:33:19PM
Build ID: xorg-x11-server 1.19.5-5.1.el7_5.0.1
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Tue Jul 21 00:55:38 2020
(++) Using config file: "/tmp/cool-gpu-00000000:01:00.0rr5gi2u3/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
X.Org X Server 1.19.5
Release Date: 2017-10-12
X Protocol Version 11, Revision 0
Build Operating System: 3.10.0-693.17.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 29 October 2018 03:33:19PM
Build ID: xorg-x11-server 1.19.5-5.1.el7_5.0.1
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.1.log", Time: Tue Jul 21 00:55:38 2020
(++) Using config file: "/tmp/cool-gpu-00000000:05:00.0f1ewmqu3/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
X.Org X Server 1.19.5
Release Date: 2017-10-12
X Protocol Version 11, Revision 0
Build Operating System: 3.10.0-693.17.1.el7.x86_64
Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64
Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet
Build Date: 29 October 2018 03:33:19PM
Build ID: xorg-x11-server 1.19.5-5.1.el7_5.0.1
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.2.log", Time: Tue Jul 21 00:55:38 2020
(++) Using config file: "/tmp/cool-gpu-00000000:09:00.0v06tfhl_/xorg.conf"
(==) Using config directory: "/etc/X11/xorg.conf.d"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
GPU :0, 58C -> [60%-60%]. Setting speed to 60%
_**hanged**_
^C
Released fan speed control for GPU at :0
_**hanged**_
nvidia-smi
Tue Jul 21 01:12:22 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 00000000:01:00.0 Off | N/A |
| 60% 53C P2 69W / 151W | 2997MiB / 8119MiB | 23% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 750 Ti Off | 00000000:05:00.0 Off | N/A |
| 48% 56C P0 30W / 38W | 1229MiB / 2002MiB | 83% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 950 Off | 00000000:09:00.0 Off | N/A |
| 32% 71C P0 45W / 75W | 974MiB / 2002MiB | 51% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
There's some troubleshooting advice on the main page; mind stepping through it?
If you're uncomfortable with pdb, looking at where things are hanging I think a good alternative would be to add print(command) above this line and see what command is making things hang. Then try and run that command manually.
There's some troubleshooting advice on the main page; mind stepping through it?
If you're uncomfortable with
pdb, looking at where things are hanging I think a good alternative would be to addprint(command)above this line and see what command is making things hang. Then try and run that command manually.
I didn't understand what pdb should do, I saw no changes.
Added print,
(==) Log file: "/var/log/Xorg.1.log", Time: Tue Jul 21 01:40:49 2020 (++) Using config file: "/tmp/cool-gpu-00000000:05:00.0prjzqd_m/xorg.conf" (==) Using config directory: "/etc/X11/xorg.conf.d" (==) Using system config directory "/usr/share/X11/xorg.conf.d" ['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':0'] ['nvidia-settings', '-a', '[fan:0]/GPUTargetFanSpeed=60', '-c', ':0'] GPU :0, 60C -> [60%-60%]. Setting speed to 60% ['nvidia-smi', '--format=csv,noheader', '--query-gpu=temperature.gpu', '-i', '00000000:05:00.0'] ['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':1'] ['nvidia-settings', '-a', '[fan:0]/GPUTargetFanSpeed=60', '-c', ':1']
this is it
Try running
nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1
in a terminal.
Adding .set_trace() somewhere should drop you in an interactive debugger prompt when you run the program.
nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1
ERROR: Unable to find display on any available system
ERROR: Unable to find display on any available system
First 2 gpus have no physical displays connected, 3rd one has a physical display connected. However, even without physical display, problem is the same.
I'd like to mention, that I was able to set all fans in some random attempts, I don't even remember what I was doing, rebooted couple times and reinstalled driver, replugged physical display, and I had script passed a few times, but after closing it and running again, it hangs.
I don't get this...
/usr/bin/coolgpus --speed 60 60
> /usr/bin/coolgpus(11)<module>()
-> parser = argparse.ArgumentParser(description=r'''
(Pdb)
(Pdb)
(Pdb)
(Pdb) help
(Pdb) run
Traceback (most recent call last):
File "/usr/bin/coolgpus", line 11, in <module>
parser = argparse.ArgumentParser(description=r'''
File "/usr/bin/coolgpus", line 11, in <module>
parser = argparse.ArgumentParser(description=r'''
File "/usr/lib64/python3.6/bdb.py", line 51, in trace_dispatch
return self.dispatch_line(frame)
File "/usr/lib64/python3.6/bdb.py", line 69, in dispatch_line
self.user_line(frame)
File "/usr/lib64/python3.6/pdb.py", line 261, in user_line
self.interaction(frame, None)
File "/usr/lib64/python3.6/pdb.py", line 352, in interaction
self._cmdloop()
File "/usr/lib64/python3.6/pdb.py", line 321, in _cmdloop
self.cmdloop()
File "/usr/lib64/python3.6/cmd.py", line 138, in cmdloop
stop = self.onecmd(line)
File "/usr/lib64/python3.6/pdb.py", line 418, in onecmd
return cmd.Cmd.onecmd(self, line)
File "/usr/lib64/python3.6/cmd.py", line 217, in onecmd
return func(arg)
File "/usr/lib64/python3.6/pdb.py", line 1028, in do_run
raise Restart
pdb.Restart
coolgpus will not work on any system with a display or any system that's expecting a display. You'll need to remove the display, restart, SSH in, and toy around until
nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1
works for -c :0, -c :1, -c :2. Googling the error message has some possible directions. But this is, I'm afraid, much more about debugging your system setup than it is about debugging the script.
Take a look at the pdb docs. No longer useful for this problem, but overall one of the most useful tools in Python programming. Especially the .pm() bit.
I did not have a display connected initially, it makes no difference. I've connected it at the last attempt to see what changes. Well, nothing.
nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :1
ERROR: Unable to find display on any available system
ERROR: Unable to find display on any available system
[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :0
ERROR: Unable to find display on any available system
ERROR: Unable to find display on any available system
[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :2
ERROR: Unable to find display on any available system
ERROR: Unable to find display on any available system
[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :3
ERROR: Unable to find display on any available system
ERROR: Unable to find display on any available system
[root@nvidia-2 ~]# nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :4
ERROR: Unable to find display on any available system
ERROR: Unable to find display on any available system
System is a plain Centos 7 with nvidia driver for headless transcoding, there is nothing custom to debug on it.
"Unable to find display on any available system" Why would it find if I removed the physical display as you suggested? Doesn't make any sense.
Not to discourage you too much but: it feels like you're hoping that I have more knowledge about this than I do. Your system absolutely has something to debug, as you can tell by the way a thing you want to do isn't working as you'd expect it to.
If you want to push forward with this, a general loop should be:
- Take anything you know about the problem you're seeing (ie,
ERROR: Unable to find display on any available system) and Google until you find people with similar problems. - Try out their fixes.
- If their fixes don't work, think about how your case differs from their case, or what you can do to more accurately isolate the problem you're seeing.
- Go back to doing more Googling.
It's hard! This might take hours or days! You might have to learn huge amounts about subjects that are totally irrelevant, just to check one possible fix! It probably won't be worth it! But, frankly: the only other choice is to give up and decide you don't care that much about coolgpu's functionality.
Unable to find display on any available - is literally what it says, no displays attached, either physical or virtual. Doesn't your script do a virtual displays to set fan speeds? Apparently it does, because I'm able to run "nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :2" but no ouput comes, while having your script open in another SSH window. So I don't understand what exactly or why "Unable to find display on any available" message has to be fixed. It is expected and not related to the issue described.
OK - in that case, use pdb, use print statements - figure out where it is the script is actually hanging, then make sure you can replicate it yourself, then Google around to figure out what's causing that hanging.
You may need to replicate the xserver setup the script is doing to replicate the hanging. You might want to tear the xserver setup bit of coolgpus out into your own script, then run that and leave it running in the background while you experiment. There're lots of ways forward, just requires a bit of ingenuity!
I did print already and posted earlier, it hangs on
['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':0'] ['nvidia-settings', '-a', '[fan:0]/GPUTargetFanSpeed=60', '-c', ':0'] GPU :0, 60C -> [60%-60%]. Setting speed to 60% ['nvidia-smi', '--format=csv,noheader', '--query-gpu=temperature.gpu', '-i', '00000000:05:00.0'] ['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':1'] ^C Google is useless here, I've spent 5 hours today before writing here.
I always have to kill Xorg like this, because it never exists. killall Xorg -9
Reinstalling xorg server does not help.
Ok, let's make it easier. I need to adjust only last GPU in list, how to select only one GPU with this script and skip others?
Right! That's the spirit.
The answer is: there's no built in method. Try cloning this repo and editing the script yourself; add a conditional somewhere to only look at specific GPUs.
More generally, you know where the script hangs but you haven't isolated the aberrant behaviour. You want to be able to enter a series of commands into the terminal and get the same hang. Then you can experiment freely with that series of commands, try different versions, add --verbose flags, etc etc etc.
I think I get the same error when running nvidia-settings from terminal
$ nvidia-settings -a [fan:0]/GPUTargetFanSpeed=60 -c :0
Unable to init server: Could not connect: Connection refused
ERROR: Unable to find display on any available system
ERROR: Unable to find display on any available system
At the same time, I am able to run coolgpus from a conda environment. @Neolo can you try to run the script from this environment
name: coolgpus
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- ca-certificates=2020.1.1=0
- certifi=2019.11.28=py38_0
- ld_impl_linux-64=2.33.1=h53a641e_7
- libedit=3.1.20181209=hc058e9b_0
- libffi=3.2.1=hd88cf55_4
- libgcc-ng=9.1.0=hdf63c60_0
- libstdcxx-ng=9.1.0=hdf63c60_0
- ncurses=6.2=he6710b0_0
- openssl=1.1.1d=h7b6447c_4
- pip=20.0.2=py38_1
- python=3.8.1=h0371630_1
- readline=7.0=h7b6447c_5
- setuptools=45.2.0=py38_0
- sqlite=3.31.1=h7b6447c_0
- tk=8.6.8=hbc83047_0
- wheel=0.34.2=py38_0
- xz=5.2.4=h14c3975_4
- zlib=1.2.11=h7b6447c_3
- pip:
- coolgpus==0.17
Yep, I think @Neolo is right about the ERROR being a symptom of the missing xserver env. Still expect that command to be the source of the hang since it was the last command printed, just gonna need more work to get a manual reproduction.
I'll be surprised if the env is causing the hang, but it's a good idea since it's an easy thing to check.
So weird. I just removed contents of /etc/X11/ and ran
nvidia-xconfig --allow-empty-initial-configuration --enable-all-gpus --cool-bits=28 --separate-x-screens --enable-all-gpus --use-display-device=none
Using X configuration file: "/etc/X11/xorg.conf". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0 (1)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0 (2)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen0 (3)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1 (1)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1 (2)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen1 (3)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2 (1)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2 (2)". Option "AllowEmptyInitialConfiguration" "True" added to Screen "Screen2 (3)". Backed up file '/etc/X11/xorg.conf' as '/etc/X11/xorg.conf.backup' New X configuration file written to '/etc/X11/xorg.conf'
and right after for one time only I was able to pass fan speeds to 1st AND 2nd gpus only, 3rd hanged. Closed the script, ran again - same problem, only 1st gpu now, again...
At the same time, I am able to run
coolgpusfrom acondaenvironment. @Neolo can you try to run the script from this environment
I'm not into python, tell me what to run, I didn't get it.
I'm not into python, tell me what to run, I didn't get it.
- https://conda.io/projects/conda/en/latest/user-guide/install/linux.html (please follow the instructions for Miniconda)
- Verify installation (
conda --help) - Save the environment I shared with you into
env.yml -
conda env create -f env.yml-- it should install the environment to your machine -
conda activate env-- activating the virtual env - Run some examples from the
README.mdto see if it hangs in the same way.
Will try that environment tomorrow.
The answer is: there's no built in method. Try cloning this repo and editing the script yourself; add a conditional somewhere to only look at specific GPUs.
As for now, I just made a dirty trick to select a last GPU from the list, which is "burning" right now at 72 C.
def gpu_buses():
# return log_output(['nvidia-smi', '--format=csv,noheader', '--query-gpu=pci.bus_id']).splitlines()
return '00000000:09:00.0'.splitlines()
and it sets the speed fine, no hangs,
I'm not into python, tell me what to run, I didn't get it.
- https://conda.io/projects/conda/en/latest/user-guide/install/linux.html (please follow the instructions for Miniconda)
- Verify installation (
conda --help)- Save the environment I shared with you into
env.ymlconda env create -f env.yml-- it should install the environment to your machineconda activate env-- activating the virtual env- Run some examples from the
README.mdto see if it hangs in the same way.
Installed Miniconda, activated env, running "$(which coolgpus) --temp 60 60" just doesn't do anything, not even setting the first GPU at all.
[root@nvidia-2 ~]# conda env create -f env.yml [root@nvidia-2 ~]# conda activate coolgpus (coolgpus) [root@nvidia-2 ~]# conda -V conda 4.8.3 (coolgpus) [root@nvidia-2 ~]# $(which coolgpus) --temp 60 60 No existing X servers, we're good to go Starting xserver: Xorg :0 -once -config /tmp/cool-gpu-00000000:01:00.0qa_grbj8/xorg.conf Starting xserver: Xorg :1 -once -config /tmp/cool-gpu-00000000:05:00.089yczaej/xorg.conf Starting xserver: Xorg :2 -once -config /tmp/cool-gpu-00000000:09:00.0khgo7uqs/xorg.conf
X.Org X Server 1.19.3 Release Date: 2017-03-15 X Protocol Version 11, Revision 0 Build Operating System: 3.10.0-514.16.1.el7.x86_64 Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet Build Date: 05 August 2017 06:19:43AM Build ID: xorg-x11-server 1.19.3-11.el7 Current version of pixman: 0.34.0 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.0.log", Time: Wed Jul 22 19:06:50 2020 (++) Using config file: "/tmp/cool-gpu-00000000:01:00.0qa_grbj8/xorg.conf" (==) Using system config directory "/usr/share/X11/xorg.conf.d"
X.Org X Server 1.19.3 Release Date: 2017-03-15 X Protocol Version 11, Revision 0 Build Operating System: 3.10.0-514.16.1.el7.x86_64 Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet Build Date: 05 August 2017 06:19:43AM Build ID: xorg-x11-server 1.19.3-11.el7 Current version of pixman: 0.34.0 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.2.log", Time: Wed Jul 22 19:06:50 2020 (++) Using config file: "/tmp/cool-gpu-00000000:09:00.0khgo7uqs/xorg.conf" (==) Using system config directory "/usr/share/X11/xorg.conf.d"
X.Org X Server 1.19.3 Release Date: 2017-03-15 X Protocol Version 11, Revision 0 Build Operating System: 3.10.0-514.16.1.el7.x86_64 Current Operating System: Linux nvidia-2 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-693.el7.x86_64 root=UUID=d1853314-ef45-4eaf-a4d5-ca9205c2471f ro i915.modeset=1 i915.preliminary_hw_support=1 rhgb quiet Build Date: 05 August 2017 06:19:43AM Build ID: xorg-x11-server 1.19.3-11.el7 Current version of pixman: 0.34.0 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Markers: (--) probed, (**) from config file, (==) default setting, (++) from command line, (!!) notice, (II) informational, (WW) warning, (EE) error, (NI) not implemented, (??) unknown. (==) Log file: "/var/log/Xorg.1.log", Time: Wed Jul 22 19:06:50 2020 (++) Using config file: "/tmp/cool-gpu-00000000:05:00.089yczaej/xorg.conf" (==) Using system config directory "/usr/share/X11/xorg.conf.d" Released fan speed control for GPU at :0 Released fan speed control for GPU at :1 ^C
I am having the same issue and nobody can solve the issue, What I do is to create a custom xorg file that its working for me, and then on the nvidia app on ubuntu I can on powermizer change the settings of each fan for each GPUs, this is not usefull when working via SSH on headless but unfourtunatly there is no solution anywhere for an easy to use nvidia-settings.
So basically this is what worked for me
Table of content:
How change GPU fan speeds in Ubuntu
1- In the applications, open NVIDIA X Server Settings
2- Select the GPU currently used for display output (should be the GPU in first PCIe slot)
3- Take note of the Bus ID
4- Run the following commands
*sudo nvidia-xconfig --enable-all-gpus
sudo nvidia-xconfig --cool-bits=28
sudo reboot*
5- After the computer reboots, plug the monitor into the last GPU
6- Open NVIDIA X-Server Settings again
7- Select the GPU currently used for display output
8- Take note of the Bus ID
9- Run sudo nano /etc/X11/xorg.conf The GPUs will be listed in “Device” sections with formatting similar to this:
**Section** “Device”
**Identifier** “name”
**Driver** “driver”entries…
**EndSection**
10- Identify the GPUs with the Bus IDs that were previously noted
11- Swap the Bus IDs of the two GPUs
12- Press Ctrl+X to close “xorg.conf”
13- Press Y to save the file
14- Press “Enter” without changing the file name
15- Reboot
Fan speeds can now be changed from NVIDIA X Server Settings by selecting the Thermal Settings for each GPU and checking the option to “Enable GPU Fan Settings” Set the fan speed with the slider and click “Apply” to save it
Never version - worse it works.
`(==) Log file: "/var/log/Xorg.2.log", Time: Mon Jan 25 22:02:05 2021 (++) Using config file: "/tmp/cool-gpu-00000000:09:00.0ngck_2l3/xorg.conf" (==) Using config directory: "/etc/X11/xorg.conf.d" (==) Using system config directory "/usr/share/X11/xorg.conf.d" GPU :0, 66C -> [60%-65%]. Setting speed to 60% GPU :1, 36C -> [30%-30%]. Setting speed to 30%
Command timed out: nvidia-settings -a [gpu:0]/GPUFanControlState=1 -c :1
Released fan speed control for GPU at :0 Command timed out: nvidia-settings -a [gpu:0]/GPUFanControlState=0 -c :1
Terminating xserver for display :0 Terminating xserver for display :1 Terminating xserver for display :2 Traceback (most recent call last): File "/usr/bin/coolgpus", line 89, in log_output p.wait(60) File "/usr/lib64/python3.6/subprocess.py", line 1469, in wait raise TimeoutExpired(self.args, timeout) subprocess.TimeoutExpired: Command '['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=1', '-c', ':1']' timed out after 60 seconds
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/bin/coolgpus", line 239, in manage_fans set_speed(display, s) File "/usr/bin/coolgpus", line 224, in set_speed assign(display, '[gpu:0]/GPUFanControlState=1') File "/usr/bin/coolgpus", line 221, in assign log_output(['nvidia-settings', '-a', command, '-c', display]) File "/usr/bin/coolgpus", line 102, in log_output raise ValueError('Command crashed with return code ' + str(p.returncode) + ': ' + ' '.join(command)) ValueError: Command crashed with return code None: nvidia-settings -a [gpu:0]/GPUFanControlState=1 -c :1
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/bin/coolgpus", line 89, in log_output p.wait(60) File "/usr/lib64/python3.6/subprocess.py", line 1469, in wait raise TimeoutExpired(self.args, timeout) subprocess.TimeoutExpired: Command '['nvidia-settings', '-a', '[gpu:0]/GPUFanControlState=0', '-c', ':1']' timed out after 60 seconds
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/bin/coolgpus", line 266, in
Jan 25 22:17:24 nvidia-2 coolgpus: File "/usr/bin/coolgpus", line 266, in
**for Gods sake.... ridiculous Open /usr/bin/coolgpus add on top:
import subprocess
and in the function kill_xservers() on top of it:
subprocess.run(["killall", "Xorg -9"])
return
ditch the rest of this function. Solved.**
For anyone still experiencing this issue, I have slapped together a bash script which at least allows for setting a fixed fan speed for all GPU in the system, regardless if a monitor is attached. It supports amdgpu too: https://github.com/lavanoid/Linux_GPU_Fan_Control