Use Adapt.jl to change storage and element type
In order to eventually support GPU computation we need
to use Adapt.jl to allow GPU backend packages to swap
out host-array types like CuArray with device-side types
like CuDeviceArray.
Additionally this will allow us to change the element type
of a simulation by using adapt(Array{Float32}.
Review checklist
This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.
Purpose and scope
- [ ] The PR has a single goal that is clear from the PR title and/or description.
- [ ] All code changes represent a single set of modifications that logically belong together.
- [ ] No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.
Code quality
- [ ] The code can be understood easily.
- [ ] Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
- [ ] There are no redundancies that can be removed by simple modularization/refactoring.
- [ ] There are no leftover debug statements or commented code sections.
- [ ] The code adheres to our conventions and style guide, and to the Julia guidelines.
Documentation
- [ ] New functions and types are documented with a docstring or top-level comment.
- [ ] Relevant publications are referenced in docstrings (see example for formatting).
- [ ] Inline comments are used to document longer or unusual code sections.
- [ ] Comments describe intent ("why?") and not just functionality ("what?").
- [ ] If the PR introduces a significant change or new feature, it is documented in
NEWS.mdwith its PR number.
Testing
- [ ] The PR passes all tests.
- [ ] New or modified lines of code are covered by tests.
- [ ] New or modified tests run in less then 10 seconds.
Performance
- [ ] There are no type instabilities or memory allocations in performance-critical parts.
- [ ] If the PR intent is to improve performance, before/after time measurements are posted in the PR.
Verification
- [ ] The correctness of the code was verified using appropriate tests.
- [ ] If new equations/methods are added, a convergence test has been run and the results are posted in the PR.
Created with :heart: by the Trixi.jl community.
Codecov Report
Attention: Patch coverage is 68.44262% with 77 lines in your changes missing coverage. Please review.
Project coverage is 96.67%. Comparing base (
43cd98d) to head (1acf007). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2212 +/- ##
==========================================
- Coverage 96.83% 96.67% -0.16%
==========================================
Files 504 507 +3
Lines 41816 42010 +194
==========================================
+ Hits 40490 40610 +120
- Misses 1326 1400 +74
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 96.67% <68.44%> (-0.16%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
What's the status of this PR?
Okay, current status:
- Removed
AbstractHeterogenousContainerand replaced it with astorage_typefunction that uses Adapt infrastructure for wrapped arrays. - Started writing some docs on heterogeneous computing & Trixi
- Added a minimal extension on CUDA.jl
Open questions:
- [x] Should I untangle the whole VecOfArrays business into a separate PR? I am not all that happy with that implementation, but it is something we need to address for full CUDA support.
- [x] Can I add CUDA.jl as a test dependency, or should I pursue more the strategy in #2367 where I add it on-demand?
- [ ] Right now we are not adapting the mesh, similar to
remake, is that okay?
Started writing some docs on heterogeneous computing & Trixi
That's great news, excellent!
Should I untangle the whole VecOfArrays business into a separate PR? I am not all that happy with that implementation, but it is something we need to address for full CUDA support.
If it's only with a reasonable overhead, I'd prefer to have a separate PR to make it easer to understand and review the code.
Can I add CUDA.jl as a test dependency, or should I pursue more the strategy in add minimal CUDA pipeline #2367 where I add it on-demand?
Puh. Do you have a rough idea about how much overhead (downloading + precompilation) this will add to the non-CUDA CI runs? That is, are we talking about tens of seconds or rather minutes?
For the time being, my take would be to hold off adding it as a test dependency for now, since it will also impact people testing locally.
Right now we are not adapting the mesh, similar to
remake, is that okay?
It should be OK. The mesh in Trixi.jl is only used as an "infrastructure provider", i.e., it allows the solvers to figure out which cells exist, who their neighbors are, where their neighbors live (when running with MPI parallelism) etc. during the initialization phase. Thus in a usual simulation run there are no performance-critical loops working directly on mesh data.
One thought I had: It might make sense to create a meta issue for upcoming (and deferred) changes for GPU/heterogeneous computing support. That is, a single issue to keep track of the various PRs, other issues, and maybe open questions/TODOs that are not living in their own issue.
Puh. Do you have a rough idea about how much overhead (downloading + precompilation) this will add to the non-CUDA CI runs? That is, are we talking about tens of seconds or rather minutes?
The most annoying thing is likely downloading the CUDA_Runtime_jll,, other than that we are talking seconds.
The biggest issue is that without it testing the storage_type=X change is almost impossible and would require significant mocking.
If it's only with a reasonable overhead, I'd prefer to have a separate PR to make it easer to understand and review the code.
That's somewhat related to the issue above, it is a bit of code that doesn't impact anything until you are actually trying to move the model to the GPU.
I could only do the real_type part of this PR , but that excludes quite a bit of complexity.
Thus in a usual simulation run there are no performance-critical loops working directly on mesh data.
Yeah, it's more a question if we are going to access data within it from the GPU, but I guess we don't know until we actually have a model running. As far as I can tell right now Lars didn't move it either.
Puh. Do you have a rough idea about how much overhead (downloading + precompilation) this will add to the non-CUDA CI runs? That is, are we talking about tens of seconds or rather minutes?
The most annoying thing is likely downloading the CUDA_Runtime_jll,, other than that we are talking seconds.
The biggest issue is that without it testing the
storage_type=Xchange is almost impossible and would require significant mocking.
OK, then I'd say let's give it a try and see where it takes us.
If it's only with a reasonable overhead, I'd prefer to have a separate PR to make it easier to understand and review the code.
That's somewhat related to the issue above, it is a bit of code that doesn't impact anything until you are actually trying to move the model to the GPU.
I could only do the
real_typepart of this PR , but that excludes quite a bit of complexity.
Hm, ok. I understand you're saying that we'd be able to split it, but then not really review the impact that it will have later on in the subsequent PRs? In that case, it might be best to keep it in one PR. In that case, please try to keep the PR as focused as possible (such that it does not blow up in size even more).
Thus in a usual simulation run there are no performance-critical loops working directly on mesh data.
Yeah, it's more a question if we are going to access data within it from the GPU, but I guess we don't know until we actually have a model running. As far as I can tell right now Lars didn't move it either.
Right. No, there shouldn't be any access of mesh data from the GPU during the regular rhs! invocation. All possible mesh accesses would occur in one of the callbacks.
This looks like test_trixi_include is passing through an expression and is not widening getting it from the parent scope.
/var/lib/buildkite-agent/builds/gpuci-4/julialang/trixi-dot-jl/examples/p4est_2d_dgsem/elixir_advection_basic.jl
elixir_advection_basic.jl (Float32): Error During Test at /var/lib/buildkite-agent/builds/gpuci-4/julialang/trixi-dot-jl/test/test_trixi.jl:186
Got exception outside of a @test
LoadError: UndefVarError: `CuArray` not defined
This looks like
test_trixi_includeis passing through an expression and is not widening getting it from the parent scope.
I have almost no idea what this exactly means, but we are actually working on test_trixi_include .
https://github.com/trixi-framework/TrixiTest.jl/pull/6
Yeah, the issue seems to be that the symbol "CuArray" is "copy-pasted" into the module instead of resolving it ahead of time to Main.CUDA.CuArray. So I would need to put using CUDA into the recipe, which is not ideal.
Next steps:
- Symbol support for AMDGPU
- Noisy loading failure for AMDGPU https://github.com/JuliaGPU/AMDGPU.jl/issues/794
elixir_advection_basic.jl (Float32)- Invalidations?
- Downgrade?
- Aqua
Aqua.jl: Error During Test at /home/runner/work/Trixi.jl/Trixi.jl/test/test_aqua.jl:20
Test threw exception
Expression: isnothing(check_no_implicit_imports(Trixi, skip = (Core, Base, Trixi.P4est, Trixi.T8code, Trixi.EllipsisNotation)))
ImplicitImportsException
Module `Trixi` is relying on the following implicit imports:
* `KernelAbstractions` which is exported by `KernelAbstractions`
* `@index` which is exported by `KernelAbstractions`
* `@kernel` which is exported by `KernelAbstractions`
* `Backend` which is exported by `KernelAbstractions`
* `get_backend` which is exported by `KernelAbstractions`
Ok, I seem to have won the battle against the downgrade CI. From my perspective this is now done.
Code-coverage is tricky since we have multiple CI services is now. I can look at that, but I would prefer not holding this PR up for that.
@ranocha added a news entry as well.
In the future should I squash the PR beforehand? I normally adjust the commit message before I squash merge so that it is useful (instead of my messy development history)
We usually don't do that.