kernel-tools icon indicating copy to clipboard operation
kernel-tools copied to clipboard

Measure the stability of kernels without hardware IO/Coherency

Open moul opened this issue 10 years ago • 11 comments

Stable versions 3.17.8 and 3.18.5 have been released with a fix that disables the Marvell's hardware IO/Coherency feature

It seems that random problems appeared when bumping to this kernel versions

We will probably skip stable versions until the a patch re-enable it


  • https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.18.5
  • https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.17.8
commit 1f20756ce695ee56c2899e95757497d9c1cc8bbb
Author: Thomas Petazzoni <[email protected]>
Date:   Fri Jan 16 17:11:27 2015 +0100

    ARM: mvebu: completely disable hardware I/O coherency

    commit 8f1e8ee28660018a935c7576b9af8ffe1feab54c upstream.

    The current hardware I/O coherency is known to cause problems with DMA
    coherent buffers, as it still requires explicit I/O synchronization
    barriers, which is not compatible with the semantics expected by the
    Linux DMA coherent buffers API.

    So, in order to have enough time to validate a new solution based on
    automatic I/O synchronization barriers, this commit disables hardware
    I/O coherency entirely. Future patches will re-enable it.

    Signed-off-by: Thomas Petazzoni <[email protected]>
    Signed-off-by: Andrew Lunn <[email protected]>
    Signed-off-by: Greg Kroah-Hartman <[email protected]>

cc @tpetazzoni

moul avatar Feb 04 '15 17:02 moul

Hum, I am a bit surprised, because the current HW I/O coherency implementation is known to be broken, so disabling it should not make things worse, but actually better.

tpetazzoni avatar Feb 05 '15 08:02 tpetazzoni

Not sure at 100% that this patch is the issue, I'll confirm this today.

We still have the 'rare' issue with 3.17, I've spent some time playing with the 3.18, we have a lot of random crashes (I don't have a reproducible scenario, it's too random, but I can make them crash in a few minutes). This patch is on the list of candidates, but maybe that's something else.

Maybe this patch actually fixes the 'rare' issue, and something else causes the random crashes ; I haven't caught this regression because I thought it was the 'rare' issue.

I'll mail you soon with numbers concerning power usage.

aimxhaisse avatar Feb 05 '15 08:02 aimxhaisse

We will test today:

  • 3.18.4
  • 3.18.5
  • 3.18.5 with a git revert to re-enable the feature (https://github.com/online-labs/kernel-config/tree/testing/3.18.5-iocoherency)

moul avatar Feb 05 '15 09:02 moul

@moul Thanks a lot for all this testing!

tpetazzoni avatar Feb 05 '15 12:02 tpetazzoni

Ok, so I really think this patch introduces stability issues, what I did:

  • reboot 10 times a kernel, see if everything boots, try a few md5sum on several files to trigger I/Os and network (nbd)

I did it on three kernels:

  • 3.18.5 with this patch
  • 3.18.5 without this patch (that's the only difference between both kernel)
  • 3.17.0

With the first one, several services randomly crashes at startup, the md5sum often triggers a kernel panic (I had 1 correct boot out of 10). With the second and third ones, I don't have these issues.

(there is still the 'rare' issue difficult to trigger on the second/third kernel)

aimxhaisse avatar Feb 05 '15 17:02 aimxhaisse

@aimxhaisse Thanks a lot for the testing. This is not good news. Can you confirm the tests have been made on Armada XP ?

tpetazzoni avatar Feb 05 '15 20:02 tpetazzoni

@tpetazzoni Yes, that was on Armada XP.

There's still some hope that maybe this patch indeed solved the 'rare' issue, but I can't verify this.

aimxhaisse avatar Feb 05 '15 21:02 aimxhaisse

@aimxhaisse The L2 support on Armada XP does not have any ->clean, ->invalidate or ->sync operations. Normlally, the L2 on Armada XP is an "inner" cache, and the L1 is supposed to broadcast its maintenance operations to the L2 automatically. So normally, when the kernel does L1 maintenance operations, it should take care of the L2. However, we don't know if this broadcasting mechanism is supposed to work fine or not without HW I/O coherency. This is something we could check with Marvell.

Another thing you could try is our patches to make I/O coherency normally work properly, using the automatic barriers. I don't remember if you tested those patches already or not.

To test the patch that disables I/O coherency, it would be interesting to try to use the Armada 370 cache operations on Armada XP, and see if it helps. I'll try to talk with Marvell about this.

tpetazzoni avatar Feb 06 '15 10:02 tpetazzoni

@tpetazzoni Thanks for the details! I've started reading some docs/code about it, but I'm still wandering in the wilderness for now. I had tried some patches you gave me regarding I/O coherency (on infradead's repository), but still had stability issues. However, I don't know which 'kind' of stability issue it was, I can re-run some tests with those. I'll also try Willy's suggestion (backporting this patch on an older kernel).

aimxhaisse avatar Feb 06 '15 14:02 aimxhaisse

Just saw this commit https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=6ab11bbf0eeed2bf7a2bfb3a7880a0bbed10cbd9 on the 3.18.6, it seems to not affect armada XP.

We will give a try to the 3.18.6 anyway

moul avatar Feb 09 '15 16:02 moul

It seems that this issue is fixed on 4.2 kernels \o/ ( https://github.com/scaleway/kernel-tools/blob/master/4.2-std/.config )

aimxhaisse avatar Sep 15 '15 08:09 aimxhaisse