ipex-llm
ipex-llm copied to clipboard
PPML Performance Unstable
Test GBDT and LR (https://github.com/intel-analytics/BigDL/tree/v2.3.0/ppml/trusted-big-data-ml/scala/docker-occlum/kubernetes) with occlum and K8S in the clusters of 3 worker nodes (Intel 5418Y's SGX). Repeat the same tests, sometimes fail with heartbeat loss and the time costs vary dramatically, from 148s to 200s (1GB SGX_EXECUTOR_JVM_MEM_SIZE and 2GB SGX_MEM_SIZE per core); if decrease the core numbers to increase the SGX_EXECUTOR_JVM_MEM_SIZE to 6GB and SGX_MEM_SIZE to 12 GB per core, no heartbeat loss happen until now, but the time costs of fit() still fluctuat between 125s and 240s. How to adjust the enviroment and parameters to make the performance more stable?
The heartbeat loss problem may be caused by THE SGX_MEM_SIZE
is too low. And I have never met The randomly fit() time issue. I think this may caused by the data cached? You can try this cmd echo 3 > /proc/sys/vm/drop_caches
to remove cached and benchmark again. Maybe you can use latest image intelanalytics/bigdl-ppml-trusted-big-data-ml-scala-occlum:2.4.0-SNAPSHOT
to test. If the issue still persists, please provide detailed config and results for me to reproduce.