kernel-tools
kernel-tools copied to clipboard
Measure the stability of kernels without hardware IO/Coherency
Stable versions 3.17.8 and 3.18.5 have been released with a fix that disables the Marvell's hardware IO/Coherency feature
It seems that random problems appeared when bumping to this kernel versions
We will probably skip stable versions until the a patch re-enable it
- https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.18.5
- https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.17.8
commit 1f20756ce695ee56c2899e95757497d9c1cc8bbb
Author: Thomas Petazzoni <[email protected]>
Date: Fri Jan 16 17:11:27 2015 +0100
ARM: mvebu: completely disable hardware I/O coherency
commit 8f1e8ee28660018a935c7576b9af8ffe1feab54c upstream.
The current hardware I/O coherency is known to cause problems with DMA
coherent buffers, as it still requires explicit I/O synchronization
barriers, which is not compatible with the semantics expected by the
Linux DMA coherent buffers API.
So, in order to have enough time to validate a new solution based on
automatic I/O synchronization barriers, this commit disables hardware
I/O coherency entirely. Future patches will re-enable it.
Signed-off-by: Thomas Petazzoni <[email protected]>
Signed-off-by: Andrew Lunn <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
cc @tpetazzoni
Hum, I am a bit surprised, because the current HW I/O coherency implementation is known to be broken, so disabling it should not make things worse, but actually better.
Not sure at 100% that this patch is the issue, I'll confirm this today.
We still have the 'rare' issue with 3.17, I've spent some time playing with the 3.18, we have a lot of random crashes (I don't have a reproducible scenario, it's too random, but I can make them crash in a few minutes). This patch is on the list of candidates, but maybe that's something else.
Maybe this patch actually fixes the 'rare' issue, and something else causes the random crashes ; I haven't caught this regression because I thought it was the 'rare' issue.
I'll mail you soon with numbers concerning power usage.
We will test today:
- 3.18.4
- 3.18.5
- 3.18.5 with a git revert to re-enable the feature (https://github.com/online-labs/kernel-config/tree/testing/3.18.5-iocoherency)
@moul Thanks a lot for all this testing!
Ok, so I really think this patch introduces stability issues, what I did:
- reboot 10 times a kernel, see if everything boots, try a few md5sum on several files to trigger I/Os and network (nbd)
I did it on three kernels:
- 3.18.5 with this patch
- 3.18.5 without this patch (that's the only difference between both kernel)
- 3.17.0
With the first one, several services randomly crashes at startup, the md5sum often triggers a kernel panic (I had 1 correct boot out of 10). With the second and third ones, I don't have these issues.
(there is still the 'rare' issue difficult to trigger on the second/third kernel)
@aimxhaisse Thanks a lot for the testing. This is not good news. Can you confirm the tests have been made on Armada XP ?
@tpetazzoni Yes, that was on Armada XP.
There's still some hope that maybe this patch indeed solved the 'rare' issue, but I can't verify this.
@aimxhaisse The L2 support on Armada XP does not have any ->clean, ->invalidate or ->sync operations. Normlally, the L2 on Armada XP is an "inner" cache, and the L1 is supposed to broadcast its maintenance operations to the L2 automatically. So normally, when the kernel does L1 maintenance operations, it should take care of the L2. However, we don't know if this broadcasting mechanism is supposed to work fine or not without HW I/O coherency. This is something we could check with Marvell.
Another thing you could try is our patches to make I/O coherency normally work properly, using the automatic barriers. I don't remember if you tested those patches already or not.
To test the patch that disables I/O coherency, it would be interesting to try to use the Armada 370 cache operations on Armada XP, and see if it helps. I'll try to talk with Marvell about this.
@tpetazzoni Thanks for the details! I've started reading some docs/code about it, but I'm still wandering in the wilderness for now. I had tried some patches you gave me regarding I/O coherency (on infradead's repository), but still had stability issues. However, I don't know which 'kind' of stability issue it was, I can re-run some tests with those. I'll also try Willy's suggestion (backporting this patch on an older kernel).
Just saw this commit https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=6ab11bbf0eeed2bf7a2bfb3a7880a0bbed10cbd9 on the 3.18.6, it seems to not affect armada XP.
We will give a try to the 3.18.6 anyway
It seems that this issue is fixed on 4.2 kernels \o/ ( https://github.com/scaleway/kernel-tools/blob/master/4.2-std/.config )