[Strace] gVisor taking extremely longer time with LightGBM training
Description
Hi team,
We have recently found an interesting issue with gVisor. Following is the Python script we ran:
import lightgbm as lgb
import sys
from numpy.random import seed
from numpy.random import randint
def lightgbm_method(num_jobs):
count = 1000
seed(1)
data = []
for _ in range(count):
data.append(randint(0, 100, 5))
labels = randint(0, 100, count)
clf = lgb.LGBMClassifier(n_jobs=num_jobs)
clf.fit(data, labels)
return 0
lightgbm_method(int(sys.argv[1]))
For some reason the runtime largely depends on the num_jobs passed to the function. Following is the runtime of this script with the num_jobs passed:
Running on node with 8 physical cores (c6gd.2xlarge):
| num_jobs | Native Kernel (seconds) | gVisor on systrap (seconds) | gVisor on ptrace (seconds) |
|---|---|---|---|
| 1 | 4.42 | 9.50 | 44.09 |
| 2 | 4.16 | 13.14 | 68.04 |
| 4 | 4.07 | 12.82 | 60.33 |
| 7 | 4.71 | 15.28 | 51.40 |
| 8 | 5.62 | 361.35 | 56.54 |
| 9 | 31.25 | 35.37 | 164.59 |
| 10 | 34.31 | 34.99 | 178.73 |
Running on node with 16 physical cores (r7gd.4xlarge):
| num_jobs | Native Kernel (seconds) | gVisor on systrap (seconds) | gVisor on ptrace (seconds) |
|---|---|---|---|
| 1 | 3.59 | 8.42 | 33.54 |
| 2 | 3.49 | 11.84 | 51.37 |
| 4 | 3.34 | 11.60 | 48.88 |
| 8 | 4.66 | 13.58 | 49.61 |
| 15 | 26.38 | 189.51 | 170.27 |
| 16 | 75.84 | 272.24 | 220.72 |
| 17 | 76.74 | 67.99 | 248.44 |
Above numbers are very consistent in our environment.
Observations
- There seems to be a pattern that with this job, it would take significantly longer time (up to 70 times longer!) to finish when
num_jobsis set to equal to the number of physical cores on host. - When num_jobs is not passed,
lgb.LGBMClassifiertakes in default value to be same as physical cores. This makes the worst case to be the default case - However, when setting
OMP_THREAD_LIMITenv variable to 1, even num_jobs is equal to physical cores, the job takes very fast to complete. - With
ptraceplatform, it takes longer to complete in general. However, whennum_jobsis close to physical cores,ptraceactually surpassessystrap. This might indicate some issues insystrap - There is a known issue on lightgbm with OpenMP that multi-threading with lightgbm could be hanging. We followed the step to set the
num_threads=1, and the issue no longer exists. But it is still not clear if the performance degradation is caused by this issue, as we do not observe same level of degradation with native kernel.
Could you please help us understand the degradation we are seeing here, especially the case with # of physical cores is 8 and num_jobs is also set to 8? Why would gVisor suddenly takes ~70 times slower than native kernel?
Steps to reproduce
Python script to reproduce:
import lightgbm as lgb
import sys
from numpy.random import seed
from numpy.random import randint
def lightgbm_method(num_jobs):
count = 1000
seed(1)
data = []
for _ in range(count):
data.append(randint(0, 100, 5))
labels = randint(0, 100, count)
clf = lgb.LGBMClassifier(n_jobs=num_jobs)
clf.fit(data, labels)
return 0
lightgbm_method(int(sys.argv[1]))
runsc version
runsc version release-20241217.0-40-gfe855beceea5-dirty
spec: 1.1.0-rc.1
docker version (if using docker)
uname
Linux ws-uswest2-2-e20c 5.10.215-203.850.amzn2.aarch64 #1 SMP Tue Apr 23 20:32:21 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
kubectl (if using Kubernetes)
repo state (if built from source)
No response
runsc debug logs (if available)
Thanks a lot for the reproducer, I'll see if I can use it.
As to why this is happening: you can look at previous issues we've had on this like https://github.com/google/gvisor/issues/9119. For communicating between the sentry and user processes we try to use the "fast" path as much as possible, which involves spinning; however if we're core-bound this then also prevents other jobs from making progress, which leads to cascading performance losses across all jobs. We have an "intelligent" way of disabling the fast path, but I think we'll have to improve upon it.
Hi @konstantin-s-bogom , I tried to completely disable fastpath in code and did not see much improvements on it.
however if we're core-bound this then also prevents other jobs from making progress
I am not 100% sure if the node is overloaded by cpu. If you see the table, when n_threads is set to 8, it took 361.35 seconds, while when n_threads is set to 9, it only takes 35.37 seconds. n_threads=9 should introduce more cpu load compared to n_threads=8, if I understand it correctly... For some reason, it takes significantly more time only when n_threads is set to same as number of cores on the node
I ran some tests with fastpath disabled (remove this line https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/systrap/metrics.go;drc=abde965590349ca06d7cd526ec2d230c756f1697;l=326 for disabling just stub FP for example):
Normal:
time ./runsc --rootless --network none --ignore-cgroups do --force-overlay=false ./env/bin/python3 main.py 7 &>/dev/null
real 0m21.271s
user 2m20.941s
sys 0m6.019s
time ./runsc --rootless --network none --ignore-cgroups do --force-overlay=false ./env/bin/python3 main.py 8 &>/dev/null
real 0m43.996s
user 5m28.854s
sys 0m12.432s
time ./runsc --rootless --network none --ignore-cgroups do --force-overlay=false ./env/bin/python3 main.py 9 &>/dev/null
real 0m31.633s
user 1m50.803s
sys 0m34.676s
With sentry fastpath disabled:
time ./runsc-no-sentry-fastpath --rootless --network none --ignore-cgroups do --force-overlay=false ./env/bin/python3 main.py 7 &>/dev/null
real 0m21.685s
user 2m13.086s
sys 0m14.080s
time ./runsc-no-sentry-fastpath --rootless --network none --ignore-cgroups do --force-overlay=false ./env/bin/python3 main.py 8 &>/dev/null
real 0m18.566s
user 2m2.679s
sys 0m15.540s
time ./runsc-no-sentry-fastpath --rootless --network none --ignore-cgroups do --force-overlay=false ./env/bin/python3 main.py 9 &>/dev/null
real 0m47.948s
user 2m17.672s
sys 1m6.868s
With stub fastpath disabled
time ./runsc-no-stub-fastpath --rootless --network none --ignore-cgroups do --force-overlay=false ./env/bin/python3 main.py 8 &>/dev/null
real 0m45.256s
user 5m35.654s
sys 0m12.375s
On my VM I wasn't able to reproduce quite the same magnitude of perf-loss as you have, but based on these results I still think the issue is to do with fastpath enablement -- specifically sentry fastpath because this is a mostly computational workload.
Regarding your question about CPU overload -- in both cases there are only 8 active threads because systrap doesn't create more user processes than there are CPUs https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/systrap/systrap.go;drc=3a9ba17351571e343e16caec2e44215d8adc400f;l=250. What I think could be happening is that in the 8 thread case, there is not enough data to disable the sentry fast-path and the sentry just ends up spinning and consuming 1 CPU, so 1 thread doesn't make progress and becomes the bottleneck. With 9 threads there is no enough data to tell the fastpath metric heuristics that the sentry fast-path is causing more harm than good. With 8 threads, there are 7 tasks generating "good" latency results, and only 1 task generating "bad" ones; with 9 threads there would now be 2 tasks generating "bad" results, which tips the scales. Aside from freeing up the 1 core used by the sentry fast-path, it's possible the workload is spread out more evenly and no one thread is the bottleneck anymore? I am not quite sure about that last part...
You can try modifying the heuristics in pkg/sentry/platform/systrap/metrics.go and see if you come up with something that works better. I won't be able to get to this for some time, but I am happy to review your code. Something that will help is using these debug metrics to know the fastpath state throughout the runtime https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/systrap/metrics.go;drc=abde965590349ca06d7cd526ec2d230c756f1697;l=573. To use these, compile runsc with --go_tag=systrap_profiling and then specify the profiling metrics and log-file like this:
./runsc $OTHER_FLAGS \
--profiling-metrics=/systrap/numTimesSentryFastPathDisabled,/systrap/numTimesSentryFastPathEnabled,/systrap/numTimesStubFastPathDisabled,/systrap/numTimesStubFastPathEnabled,/systrap/numTimesStubKicked \
--profiling-metrics-log=./logs/metrics.log \
do $something
Thanks @konstantin-s-bogom for the explanation! That's certainly something I can look into from my end as well to optimize the fastpath code path to make it more "sensitive" to the metrics.
I have another quick question... From the table when I set n_jobs to 7, it took only 17 seconds, while when setting to 8, it took 300+ seconds. According to your data, even disabling sentry fastpath, it would still given us 120+ seconds. Do you think if optimizing sentry fastpath (with mix of fastpath and no-fastpath) would potentially give us even better performance than completely disabling it, or there could be other places we can optimize as well to make the performance align with when n_jobs is 7? Thanks!
Do you think if optimizing sentry fastpath (with mix of fastpath and no-fastpath) would potentially give us even better performance than completely disabling it
For sentry fastpath this is unlikely, because the sentry fastpath only takes one system thread to monitor for events from all stub threads, so it really is all-or-nothing. There maybe something to explore here for stub-thread fastpath, to turn off fastpath only for some of the threads for example... this will be hard to optimize for the general case though, and per our test results above -- not likely related to this particular issue.
or there could be other places we can optimize as well to make the performance align with when n_jobs is 7?
Just brainstorming, but maybe we could dynamically change maxSysmsgThreads to account for whether sentry fastpath is on or off? @avagin WDYT?
maybe we could dynamically change maxSysmsgThreads to account for whether sentry fastpath is on or off?
Just so we don't repeat work, I tried this out and it didn't seem to help: DNS: Account for systrap dispatcher in maxSysmsgThreads. #11606