abacus-develop icon indicating copy to clipboard operation
abacus-develop copied to clipboard

unstable performance of gnu-dav/omp-intel-dav/gnu-elpa/omp-intel-elpa

Open pxlxingliang opened this issue 1 year ago • 3 comments

Details

Since 20240112, the performance of gnu-dav/omp-intel-dav/gnu-elpa/omp-intel-elpa is unstable.

image

image

details can be found in: https://labs.dp.tech/projects/abacustest/?request=GET%3A%2Fapplications%2Fabacustest%2Fjobs%2Fsched-abacustest-summary-daily-09e212

Task list for Issue attackers (only for developers)

  • [ ] Reproduce the performance issue on a similar system or environment.
  • [ ] Identify the specific section of the code causing the performance issue.
  • [ ] Investigate the issue and determine the root cause.
  • [ ] Research best practices and potential solutions for the identified performance issue.
  • [ ] Implement the chosen solution to address the performance issue.
  • [ ] Test the implemented solution to ensure it improves performance without introducing new issues.
  • [ ] Optimize the solution if necessary, considering trade-offs between performance and other factors (e.g., code complexity, readability, maintainability).
  • [ ] Review and incorporate any relevant feedback from users or developers.
  • [ ] Merge the improved solution into the main codebase and notify the issue reporter.

pxlxingliang avatar Jan 17 '24 09:01 pxlxingliang

@pxlxingliang Could you please help locate the exact test case and the possible commits causing the performance gap in the performance report? I can't locate the former test cases and their results. Thanks.

caic99 avatar Jan 28 '24 08:01 caic99

I have checked the latest results, A clear phenomenon is that only gnu-dav/omp-intel-dav/gnu-elpa/omp-intel-elpa has the large performance variation.

688824e2033b87002678cc34dc8ab60d_j89fUrRsfviEQAAAABJRU5ErkJggg==

4d8b761f95bd339f982a899e2b077bea_jhx9+wOHhIT7++GOEwPqMTcIYgRACvLVuASFk00yMwf8DVVMylMnbyiwAAAAASUVORK5CYII=

I have compared the results at 2024-01-28 and 2024-01-29. For all examples, total time at 0128 is larger than that at 0129. While the versions at two days are both "v3.5.1(8239efb (Sat Jan 27 10:42:35 2024 +0800))", and the machines are both Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz. 0a55b6ff8c94b102bd8ad1942e6a6878_wcjkJJJAcsXYwAAAABJRU5ErkJggg==

For each example, the total time at 0129 is about 2/3 at 0128 image

pxlxingliang avatar Jan 29 '24 08:01 pxlxingliang

After connecting with BOHRIUM, this may be related to ali machine. They found the jobs run at 0128 was dispatched to BeiJin, while jobs run at 0129 was dispatched to ZhangJiaKou. Now, they have fixed the dispatch to ZhangJiaKou for abacus daily testing images (intel/gnu/cuda:latest). We can track the performance in the following days. @mohanchen @caic99

pxlxingliang avatar Jan 29 '24 09:01 pxlxingliang

Hi @pxlxingliang , We've discussed this issue with the bohrium team, and the testing instance is now pinned. Should we close this issue now?

caic99 avatar Jun 18 '24 03:06 caic99

Have added the monitor of machine performance by mixbench in the daily test. Close this issue.

pxlxingliang avatar Jun 18 '24 04:06 pxlxingliang