scream icon indicating copy to clipboard operation
scream copied to clipboard

Fix non bfb issues with latent heat in p3 on PM-GPU

Open tcclevenger opened this issue 1 year ago • 5 comments

This PR...

  • reverts 9438a980a0caebe562568beccb9b1c4b614a390f (changed P3 to use constants for latent_heat variables instead of allocating 2d views during runtime)
  • reimplements the main goal of those changes (no view allocs during runtime) but leaves these variables as views (now in the workspace manager (monolithic) or mem buffer (small kernels)).

I don't know why the previous version was non-BFB, but investigating may take some time and PM-GPU is needed for PR testing, so I suggest merging this PR. I've added "TODO" statements to track that eventually these should just be constants and can create an issue once this is merged (if others agree). The only downside is we keep the 3 temp views.

Testing

I ran the following

./cime/scripts/create_test e3sm_scream_v1 e3sm_scream_v1_long --machine pm-gpu --compiler gnugpu -c -b master -t latent_heat_pr
./cime/scripts/create_test e3sm_scream_v1_medres --machine pm-cpu --compiler=gnu -c -b master -t latent_heat_pr

and passed all baselines for CPU and GPU.

tcclevenger avatar Sep 16 '24 22:09 tcclevenger

I've narrowed down to the offending usage of latent_heat_fusion. I'm waiting to merge this in case the fix is simple and does not require updating baselines.

tcclevenger avatar Sep 19 '24 15:09 tcclevenger

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6060
  • Status: STARTED

Jenkins Parameters

Parameter Name Value
PR_LABELS p3;bugfix
PULLREQUESTNUM 2998
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA fd86a15a71a670198f9410bd9fb62a7a67150255
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 25120ff1fd4fb086176d21ebf888d3722a915bb8
TEST_REPO_ALIAS SCREAM

Using Repos:

Repo: SCREAM (E3SM-Project/scream)
  • Branch: tcclevenger/fix_non_bfb_issues_with_latent_heat_in_p3
  • SHA: fd86a15a71a670198f9410bd9fb62a7a67150255
  • Mode: TEST_REPO

Pull Request Author: tcclevenger

E3SM-Bot avatar Sep 19 '24 19:09 E3SM-Bot

Status Flag 'Pull Request AutoTester' - Jenkins Testing: 1 or more Jobs FAILED

Note: Testing will normally be attempted again in approx. 2 Hrs. If a change to the PR source branch occurs, the testing will be attempted again on next available autotester run.

Pull Request Auto Testing has FAILED (click to expand)

Build Information

Test Name: SCREAM_PullRequest_Autotester_Weaver

  • Build Num: 6060
  • Status: FAILED

Jenkins Parameters

Parameter Name Value
PR_LABELS p3;bugfix
PULLREQUESTNUM 2998
SCREAM_SOURCE_REPO https://github.com/E3SM-Project/scream
SCREAM_SOURCE_SHA fd86a15a71a670198f9410bd9fb62a7a67150255
SCREAM_TARGET_BRANCH master
SCREAM_TARGET_REPO https://github.com/E3SM-Project/scream
SCREAM_TARGET_SHA 25120ff1fd4fb086176d21ebf888d3722a915bb8
TEST_REPO_ALIAS SCREAM
SCREAM_PullRequest_Autotester_Weaver # 6060 FAILED (click to see last 100 lines of console output)

Warning: Permanently added the ECDSA host key for IP address '140.82.113.3' to the list of known hosts.
Submodule 'extern/Catch2' ([email protected]:E3SM-Project/Catch2) registered for path 'externals/ekat/extern/Catch2'
Submodule 'extern/kokkos' ([email protected]:E3SM-Project/kokkos) registered for path 'externals/ekat/extern/kokkos'
Submodule 'extern/spdlog' ([email protected]:gabime/spdlog.git) registered for path 'externals/ekat/extern/spdlog'
Submodule 'extern/yaml-cpp' ([email protected]:SNLComputation/yaml-cpp.git) registered for path 'externals/ekat/extern/yaml-cpp'
Cloning into '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/Catch2'...
Cloning into '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/kokkos'...
Cloning into '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/spdlog'...
Cloning into '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/yaml-cpp'...
Warning: Permanently added the ECDSA host key for IP address '140.82.114.4' to the list of known hosts.
ERROR: Repository not found.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of '[email protected]:SNLComputation/yaml-cpp.git' into submodule path '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/yaml-cpp' failed Failed to clone 'extern/yaml-cpp'. Retry scheduled Cloning into '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/yaml-cpp'... ERROR: Repository not found. fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of '[email protected]:SNLComputation/yaml-cpp.git' into submodule path '/home/e3sm-jenkins/weaver/workspace/SCREAM_PullRequest_Autotester_Weaver/6060/scream/externals/ekat/extern/yaml-cpp' failed Failed to clone 'extern/yaml-cpp' a second time, aborting Failed to recurse into submodule path 'externals/ekat'

at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:2846)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:2185)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.CliGitAPIImpl$7.lambda$execute$0(CliGitAPIImpl.java:1573)
at Jenkins v2.462.1//com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at Jenkins v2.462.1//com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:76)
at Jenkins v2.462.1//com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at Jenkins v2.462.1//com.google.common.util.concurrent.DirectExecutorService.execute(DirectExecutorService.java:51)
at java.base/java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:184)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.cgit.GitCommandsExecutor.submitRemainingCommand(GitCommandsExecutor.java:77)
at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.cgit.GitCommandsExecutor.invokeAll(GitCommandsExecutor.java:70)

Also: hudson.remoting.Channel$CallSiteStackTrace: Remote call to weaver at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1826) at hudson.remoting.UserRequest$ExceptionResponse.retrieve(UserRequest.java:356) at hudson.remoting.Channel.call(Channel.java:1042) at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:153) at jdk.internal.reflect.GeneratedMethodAccessor105.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:569) at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:138) at PluginClassLoader for git-client/jdk.proxy30/jdk.proxy30.$Proxy100.execute(Unknown Source) at PluginClassLoader for git//hudson.plugins.git.extensions.impl.SubmoduleOption.onCheckoutCompleted(SubmoduleOption.java:196) at PluginClassLoader for git//hudson.plugins.git.GitSCM.checkout(GitSCM.java:1388) at hudson.scm.SCM.checkout(SCM.java:540) at hudson.model.AbstractProject.checkout(AbstractProject.java:1247) at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:649) at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:85) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:521) at hudson.model.Run.execute(Run.java:1894) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44) at hudson.model.ResourceController.execute(ResourceController.java:101) at hudson.model.Executor.run(Executor.java:446) Caused: hudson.plugins.git.GitException at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.cgit.GitCommandsExecutor.checkResult(GitCommandsExecutor.java:89) at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.cgit.GitCommandsExecutor.invokeAll(GitCommandsExecutor.java:69) at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.cgit.GitCommandsExecutor.invokeAll(GitCommandsExecutor.java:47) at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.CliGitAPIImpl$7.execute(CliGitAPIImpl.java:1576) at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$GitCommandMasterToSlaveCallable.call(RemoteGitImpl.java:170) at PluginClassLoader for git-client//org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$GitCommandMasterToSlaveCallable.call(RemoteGitImpl.java:161) at hudson.remoting.UserRequest.perform(UserRequest.java:211) at hudson.remoting.UserRequest.perform(UserRequest.java:54) at hudson.remoting.Request$2.run(Request.java:377) at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused: java.io.IOException: Could not perform submodule update at PluginClassLoader for git//hudson.plugins.git.extensions.impl.SubmoduleOption.onCheckoutCompleted(SubmoduleOption.java:201) at PluginClassLoader for git//hudson.plugins.git.GitSCM.checkout(GitSCM.java:1388) at hudson.scm.SCM.checkout(SCM.java:540) at hudson.model.AbstractProject.checkout(AbstractProject.java:1247) at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:649) at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:85) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:521) at hudson.model.Run.execute(Run.java:1894) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44) at hudson.model.ResourceController.execute(ResourceController.java:101) at hudson.model.Executor.run(Executor.java:446) Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash -le

cd $WORKSPACE/${BUILD_ID}/

./scream/components/eamxx/scripts/jenkins/jenkins_cleanup.sh [SCREAM_PullRequest_Autotester_Weaver] $ /bin/bash -le /tmp/jenkins8739115007280394007.sh POST BUILD TASK : SUCCESS END OF POST BUILD TASK : 0 Sending e-mails to: [email protected] Finished: FAILURE

E3SM-Bot avatar Sep 19 '24 19:09 E3SM-Bot

@tcclevenger did you add WIP to this PR because the AT wasn't working? If so, I think we can unWIP it and add the RETEST label.

AaronDonahue avatar Sep 24 '24 20:09 AaronDonahue

@tcclevenger did you add WIP to this PR because the AT wasn't working? If so, I think we can unWIP it and add the RETEST label.

No, I put this WIP since we don't want to merge it, I was just using it to track the issue. I could close it, but I didn't in case it turns out we actually need it. But I don't think that will be the case.

tcclevenger avatar Sep 24 '24 20:09 tcclevenger

closing this since it keeps showing up as request for review for me

mahf708 avatar May 06 '25 14:05 mahf708

@tcclevenger is this still an issue, should we re-open this on the E3SM side?

AaronDonahue avatar May 06 '25 16:05 AaronDonahue

No, this is outdated. Ok to close.

tcclevenger avatar May 06 '25 22:05 tcclevenger