cacule-cpu-scheduler icon indicating copy to clipboard operation
cacule-cpu-scheduler copied to clipboard

Experiencing some random hangs under heavy workload

Open ltsdw opened this issue 2 years ago • 67 comments

I've been experiencing these hangs (where everything freezes for like 5 secs) when playing some games on wine that usually uses a lot of the CPU, sometimes when watching some videos.

To be sure that was cacule patch and nothing else I tested with the mainline arch kernel (no hangs). As I have some patches applied at my kernel I tried compiling it without the cacule patch (also no hangs). And then tried applying the cacule again and the hangs comes back.

I'm not quite sure. But I think that the commit that introduced it is the 06cb3974.

I didn't tried reverting the commit to test, only tested with these:

cacule-patch-with-hangs.txt - patch where hangs happens

cacule-without-hangs.txt - and without the hangs

But if needed I can try bisecting later to see exactly which commit causes it.

ltsdw avatar Aug 17 '21 11:08 ltsdw

Also all the tunable configs are the default.

ltsdw avatar Aug 17 '21 11:08 ltsdw

Hi @ltsdw

Based on https://github.com/hamadmarri/cacule-cpu-scheduler/discussions/43, Have you tried to reduce the kernel.sched_cache_factor to a lower value e.g. 0? Also from my experience, you may try to set the kernel.sched_cacule_yield to 0 since it may cause freeze due to some I/O issues, see https://github.com/hamadmarri/cacule-cpu-scheduler/issues/35.

raykzhao avatar Aug 17 '21 11:08 raykzhao

Hi there @raykzhao

Thank you for your suggestion I'll try.

ltsdw avatar Aug 17 '21 12:08 ltsdw

sadly it didn't worked, tried: kernel.sched_cache_factor=0 kernel.sched_cacule_yield=0

but the hangs still.

ltsdw avatar Aug 17 '21 13:08 ltsdw

kernel.sched_cache_factor=0

Could you please also set kernel.sched_starve_factor=0

Is RDB enabled?

hamadmarri avatar Aug 17 '21 13:08 hamadmarri

Could you please also set kernel.sched_starve_factor=0

The hang still with kernel.sched_starve_factor=0

Is RDB enabled?

As I think it's enabled by default with the patch, I believe so.

ltsdw avatar Aug 17 '21 13:08 ltsdw

Could you please also set kernel.sched_starve_factor=0

The hang still with kernel.sched_starve_factor=0

Is RDB enabled?

As I think it's enabled by default with the patch, I believe so.

Could you please try without RDB?

hamadmarri avatar Aug 17 '21 13:08 hamadmarri

Could you please try without RDB?

As I don't think there is a runtime way to disable it, it's necessary recompile it, right?

ltsdw avatar Aug 17 '21 13:08 ltsdw

Could you please try without RDB?

As I don't think there is a runtime way to disable it, it's necessary recompile it, right?

Yes, you need to recompile. I think the version that was working for you was without RDB. Could you please attach the .config too?

Also provide all technical information and versions like kernel, wine, which game, and what settings.

Thanks

hamadmarri avatar Aug 17 '21 13:08 hamadmarri

Yes, you need to recompile. I think the version that was working for you was without RDB. Could you please attach the .config too?

Also provide all technical information and versions like kernel, wine, which game, and what settings.

Thanks

Sure this one here was from my last compile on 5.13.8 config.txt.

CPU: i5 5200U
GPU: Intel(R) HD Graphics 5500 (using iris)
RAM: 8 GB
Mesa: 21.3.0 (commit c0fc745b78b)
Wine: 6.13 (with some patches from tkg)
Games that I tested with: NovaRO, GTA San Andreas, Path of Exile (this one I'll blame my gpu more than anything else), but it also happen out of nowhere when watching some videos too, or when I'm compiling something.

and when you say settings, you say which ones? the cacule's ones? if it's, it's all the default.

Now let me recompile it, will take some time.

ltsdw avatar Aug 17 '21 13:08 ltsdw

I have such lags in rdr2 (only) and setting kernel.sched_interactivity_factor=50 seems to be helping. It doesnt happen without RDB, but without RDB background load has stronger negative effects. I will test kernel.sched_starve_factor=0, too.

JohnyPeaN avatar Aug 17 '21 14:08 JohnyPeaN

Yes, I can confirm, disabling the RDB did the trick, no more hangs, thank you @hamadmarri.

Also, not related to this issue but may I ask you, is there any straightforward tool to benchmark which of these tunable configs performs better?

ltsdw avatar Aug 17 '21 14:08 ltsdw

Yes, I can confirm, disabling the RDB did the trick, no more hangs, thank you @hamadmarri.

Also, not related to this issue but may I ask you, is there any straightforward tool to benchmark which of these tunable configs performs better?

Hi @ltsdw ,

Good to hear it's working fine now, however, I really would like to troubleshoot why RDB causes these freezes.

Regarding tunning, there is no specific way to test. I tried to make the defaults to work fine in general, but when you have any issue you can change them. You need to have a background on cpu scheduling so you can read about the every cacule sysctl and change them accordingly.

I would like to keep this issue open until we see why RDB performs bad with wine.

Thank you

hamadmarri avatar Aug 17 '21 16:08 hamadmarri

I suspect it is related to rcu calls and soft irq. I will post some fixes to try soon.

Thank you

hamadmarri avatar Aug 17 '21 16:08 hamadmarri

@hamadmarri, you might be onto somethinmg. This game does ~160k context switches, that might have something to do with it. But BMQ handles it, so its doable. I'm looking forward those fixes. Keep up the good work.

JohnyPeaN avatar Aug 17 '21 17:08 JohnyPeaN

@hamadmarri, you might be onto somethinmg. This game does ~160k context switches, that might have something to do with it. But BMQ handles it, so its doable. I'm looking forward those fixes. Keep up the good work.

Hi @JohnyPeaN , @ltsdw

To narrow down the troubleshooting, could you please try RDB with: CONFIG_HZ_PERIODIC=y to see if it is actually related to no_hz_{idle, full} balancing? I remember I had nohz_balancer_kick(rq); added in RDB before, but for some reasons that I forgot why I removed it from RDB trigger_load_balance function.

Also, can you try with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

Or try vise versa, in cause you have most rcu configs are disabled try to enable them.

Based on my RDB code review I have just did 2min ago, I am suspecting it is because nohz balancing. I am assuming that you are using no_hz_full?

Please let me know if any of the above changes fix the freezes so I can propose a fix based on your feedback. If non of the above configs has any positive effects, then I can investigate something else.

Thank you

hamadmarri avatar Aug 18 '21 10:08 hamadmarri

@hamadmarri needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

JohnyPeaN avatar Aug 18 '21 11:08 JohnyPeaN

@hamadmarri needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Hi @JohnyPeaN

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

Thank you

hamadmarri avatar Aug 18 '21 11:08 hamadmarri

@hamadmarri

ok, I'll try too, but I'll need some time, thank you!

ltsdw avatar Aug 18 '21 11:08 ltsdw

@hamadmarri

while compiling I noticed this:

kernel/sched/fair.c:11324:3: error: implicit declaration of function 'nohz_newidle_balance' [-Werror,-Wimplicit-function-declaration]
                nohz_newidle_balance(this_rq);
                ^
kernel/sched/fair.c:11324:3: note: did you mean 'nohz_run_idle_balance'?
kernel/sched/sched.h:2439:20: note: 'nohz_run_idle_balance' declared here
static inline void nohz_run_idle_balance(int cpu) { }
                   ^
1 error generated.
make[2]: *** [scripts/Makefile.build:273: kernel/sched/fair.o] Error 1
make[1]: *** [scripts/Makefile.build:516: kernel/sched] Error 2
make[1]: *** Waiting for unfinished jobs....

and the building failed.

ltsdw avatar Aug 18 '21 11:08 ltsdw

Nah, I think it was my fault, let me try again.

ltsdw avatar Aug 18 '21 11:08 ltsdw

strange, kernel/sched/fair.c, in fact has a declaration of nohz_newidle_balance at line 11050. actually I don't know what possible wrong here. Why when called at line 11324 it's not seeing it?

ltsdw avatar Aug 18 '21 12:08 ltsdw

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri

but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

ltsdw avatar Aug 18 '21 12:08 ltsdw

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri

but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

Hi @ltsdw

Please try first with CONFIG_HZ_PERIODIC=y only. Keep the rest as it was.

Thank you

hamadmarri avatar Aug 18 '21 12:08 hamadmarri

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

Hi @ltsdw

Please try first with CONFIG_HZ_PERIODIC=y only. Keep the rest as it was.

Thank you

@hamadmarri

But now there is a compile error happening kernel/sched/fair.c:11324:3: error: implicit declaration of function 'nohz_newidle_balance'

ltsdw avatar Aug 18 '21 12:08 ltsdw

Hi @ltsdw @hamadmarri

I think the compiling error is because the nohz_newidle_balance is not defined when CONFIG_NO_HZ_COMMON=n and CONFIG_CACULE_RDB=y. Please try the following fix:

--- a/kernel/sched/fair.c	2021-08-18 22:39:26.513174343 +1000
+++ b/kernel/sched/fair.c	2021-08-18 22:38:19.322803092 +1000
@@ -11084,9 +11084,9 @@
 {
 	return false;
 }
+#endif
 
 static inline void nohz_newidle_balance(struct rq *this_rq) { }
-#endif
 
 #endif /* CONFIG_NO_HZ_COMMON */
 

fix.patch.zip

raykzhao avatar Aug 18 '21 12:08 raykzhao

@hamadmarri @raykzhao

Ok, I tested with CONFIG_HZ_PERIODIC=y and at least for me the hangs still. Now I'll try with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

Just a question, should I still use the CONFIG_HZ_PERIODIC=y or not?

ltsdw avatar Aug 18 '21 13:08 ltsdw

@hamadmarri CONFIG_HZ_PERIODIC=y removes the random lags and game is smooth even with RDB. Tried also the other suggested config options, but nothing noticeable happened.

JohnyPeaN avatar Aug 18 '21 15:08 JohnyPeaN

@hamadmarri @JohnyPeaN

Just tested with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

and also didn't work, the hangs still happening for me.

ltsdw avatar Aug 18 '21 16:08 ltsdw

Hi @ltsdw

Another thing I would suspect is the autogroup. Have you tried to disable the autogroup? You may try to add noautogroup in your kernel boot command-line parameter.

raykzhao avatar Aug 18 '21 16:08 raykzhao