ddcutil icon indicating copy to clipboard operation
ddcutil copied to clipboard

Stop poking AMDGPU SMU I2C

Open K900 opened this issue 4 years ago • 15 comments

As of this writing, ddcutil will attempt to poke all i2c-N devices available on the system to detect monitors. However, on recent AMD cards (specifically Navi2x variants, i.e. RX 6000 series), even trying to read from the SMU I2C bus can get the GPU into an inconsistent state (see 1). This should eventually be resolved via a more correct I2C implementation on the kernel side (again, see 1), but it might be worth it to also add a filter on the ddcutil side to make sure the SMU bus is never touched - nothing good (for ddcutil) can ever come out of it anyway.

K900 avatar Mar 23 '21 11:03 K900

Is this something you've actually experienced with ddcutil, or just a general heads up?

In normal operation, ddcutil prefilters the i2c devices using /sys. If the device has a class attribute, it must either be x03 (display) or x0a (docking station). If no class attribute is found, it checks the device name, possibly augmented by the driver name. In particular, if the name starts with "smu", the device is ignored. (Probing smu devices was found to hang the system with Linux running on a Mac G5.)

If you are seeing the problem on ddcutil, we can probe further. Normally I'd just ask for the output of ddcutil environment --very-verbose, but the environment command is bit more promiscuous in choosing which I2C devices to examine, so there's a chance that it could trigger the problem.

rockowitz avatar Mar 23 '21 12:03 rockowitz

It is, unfortunately, a very real issue. Output attached. env.txt

The SMU bus reports class 0x03 (which is probably also wrong?), and is named "AMDGPU SMU", which is probably why it passes the "starts with smu" check.

K900 avatar Mar 23 '21 12:03 K900

I've added "AMDGPU SMU" to the list of names for devices that should be ignored, and pushed the change to branch 1.1.0-dev. Let me know if it works.

The problem is that the device class (e.g. x03) is tied to the device, not to individual /dev/i2c buses. (see for example line 2342 of the environment output.) This actually makes sense, since the entire device is a video adapter. So we're left with looking at the device name. Name prefixes, or driver/name combinations, are added as they are encountered, which is infrequent - "AMDGPU SMU" increases the list size from 7 to 8.

rockowitz avatar Mar 23 '21 14:03 rockowitz

Thanks! Bad news though, in order:

  1. I've just tried to compile it and I'm hitting a bunch of stringop-overflow and stringflow-truncation warnings. Build with CFLAGS="-Wno-string-overflow -Wno-stringop-truncation" helped for now, but it looks like those are actually worth fixing:
In function ‘assemble_sysfs_path2’,
    inlined from ‘rpt2_attr_binary’ at sysfs_util.c:299:4:
sysfs_util.c:244:7: error: ‘strncat’ specified bound 4096 equals destination size [-Werror=stringop-overflow=]
  244 |       strncat(buffer, "/", bufsz);
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~
sysfs_util.c:238:4: error: ‘strncpy’ specified bound 4096 equals destination size [-Werror=stringop-truncation]
  238 |    strncpy(buffer, fn_segment, bufsz);
      |    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sysfs_util.c:246:7: error: ‘strncat’ specified bound 4096 equals destination size [-Werror=stringop-truncation]
  246 |       strncat(buffer, segment, bufsz);
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sysfs_util.c:246:7: error: ‘strncat’ specified bound 4096 equals destination size [-Werror=stringop-overflow=]
  1. The ignorable_i2c_device_sysfs_name check is never hit, because we do have a class, and that class is considered "OK" here: https://github.com/rockowitz/ddcutil/blob/1.1.0-dev/src/util/sysfs_i2c_util.c#L191

K900 avatar Mar 23 '21 14:03 K900

The current 1.1.0-dev branch contains a fix to the check if a /dev/i2c device is ignorable. First the device name is checked. If the device name is not on the explicit ignorable list, then the class is checked.

It also (hopefully) addresses the stringop-overflow and stringop-truncation warnings. These were introduced in a point release of gcc 8. Modifying the code so that it compiles without warning no matter what release of gcc or clang is in use, and is still clean to read, turned out to be an ugly little problem. Even glib function g_strlcpy() gets flagged. See, for example, this long stackoverflow discussion. (I inserted "hopefully" because the full set of distribution releases and architecture variants have not yet been checked.)

rockowitz avatar Mar 26 '21 05:03 rockowitz

The latest version compiles cleanly and works perfectly, thanks! Any chance we can get this backported to the older stable releases as well?

K900 avatar Mar 26 '21 07:03 K900

What is the context in which backporting a fix to a prior release is important? The only situation in which I could see this mattering is if an application is using the shared library and can't rebuild to use the latest version. I'm not aware of any such application.

rockowitz avatar Mar 27 '21 11:03 rockowitz

Distribution policies, mostly. Debian/Ubuntu generally don't allow new feature releases into their stable branches, so 1.0.x to 1.0.x+1 is OK, but 1.0.x to 1.1.y won't be.

K900 avatar Mar 27 '21 11:03 K900

Hi @rockowitz,

This bug is a kernel bug, specifically in the way the amdgpu driver communicated with the AMD SMU. It has been fixed by the tip of:

https://gitlab.freedesktop.org/ltuikov/linux/-/commits/i2c-rework-luben

ddcutil can now freely scan AMDGPU SMU busses. Filtering out "AMDGPU SMU" should be reverted, especially when we merge the tip of my branch into mainline.

I'm using ddcutil-0.9.9 on the kernel above and it scans the AMDGPU SMU bus without a problem.

Regards, Luben

ltuikov avatar Jul 10 '21 15:07 ltuikov

ddcutil doesn't really gain anything by scanning the SMU bus anyway, and people may be running it on older kernels. I'd just keep it on the blocklist.

K900 avatar Jul 10 '21 15:07 K900

It should eventually be phased out.

It was unfortunate that this but was added to the kernel more than 2 years ago when some things were moved around.

ltuikov avatar Jul 10 '21 15:07 ltuikov

Maybe add a kernel version check?

K900 avatar Jul 10 '21 15:07 K900

In general, ddcutil avoids probing SMU devices if possible. Not only can this cause hard failures, but probing I2C buses is an expensive operation, to be avoided when we know a priori that a /dev/i2c device is not a monitor. There's no harm in leaving the check in. and avoids tying ddcutil to all but recent kernels.

rockowitz avatar Jul 11 '21 23:07 rockowitz

@ltuikov Luben, I assume you're an appropriate person to ask the following questions. If not, can you pass it on or redirect me?

I'm continuing to see problem reports with Navi GPU cards, e.g. #223, #224. What's the minimum kernel release that contains the SMU changes? I'll tell people to ensure they're at that kernel level and, if so, to post a bug report on, I assume, https://gitlab.freedesktop.org/drm/amd/-/issues. What would you like to see in the bug report?

Regards, Sanford

rockowitz avatar Sep 27 '21 14:09 rockowitz

@rockowitz Hi, The issue is resolved by at least these two patches, as shown in Linus's master branch:

544dcd74b7093a drm/amd/pm: Fix a bug in semaphore double-lock 5810323ba69289 drm/amd/pm: Fix a bug communicating with the SMU (v5)

If you have at least these two commits, then the SMU will not get in an "inconsistent state", and you don't need filtering. It just works. You can use a previous version of ddcutil, presumably without filtering.

Regards, Luben

ltuikov avatar Sep 29 '21 04:09 ltuikov