ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

PPML Performance Unstable

Open qin-xiong opened this issue 1 year ago • 1 comments

Test GBDT and LR (https://github.com/intel-analytics/BigDL/tree/v2.3.0/ppml/trusted-big-data-ml/scala/docker-occlum/kubernetes) with occlum and K8S in the clusters of 3 worker nodes (Intel 5418Y's SGX). Repeat the same tests, sometimes fail with heartbeat loss and the time costs vary dramatically, from 148s to 200s (1GB SGX_EXECUTOR_JVM_MEM_SIZE and 2GB SGX_MEM_SIZE per core); if decrease the core numbers to increase the SGX_EXECUTOR_JVM_MEM_SIZE to 6GB and SGX_MEM_SIZE to 12 GB per core, no heartbeat loss happen until now, but the time costs of fit() still fluctuat between 125s and 240s. How to adjust the enviroment and parameters to make the performance more stable?

qin-xiong avatar Oct 26 '23 08:10 qin-xiong

The heartbeat loss problem may be caused by THE SGX_MEM_SIZE is too low. And I have never met The randomly fit() time issue. I think this may caused by the data cached? You can try this cmd echo 3 > /proc/sys/vm/drop_caches to remove cached and benchmark again. Maybe you can use latest image intelanalytics/bigdl-ppml-trusted-big-data-ml-scala-occlum:2.4.0-SNAPSHOT to test. If the issue still persists, please provide detailed config and results for me to reproduce.

hzjane avatar Oct 26 '23 09:10 hzjane