Thomas Baumann

Results 58 comments of Thomas Baumann

Regarding the slow `Alltoallw`, I used OpenMPI and ParaStationMPI (MPICH) modules as installed on the Jülich supercomputers. I trust they installed it well. Maybe the data is copied to host...

> @brownbaerchen would this https://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuFFTMp/JAX_FFT help running cuFFTMp in Python? Thanks for mentioning this! Since I haven't done any work with pybind11 or JAX before, this is a bit opaque...

> Sorry for long delay. @brownbaerchen Your snippet above caught my attention. You shouldn't need the lines calling `fftw.aligned()` and then converting to CuPy arrays. It's a waste of memory...

@leofang, yes, both MPI implementations are built with CUDA support. I talked to the support here in Jülich and they also don't know why this is happening, but assured me...

> @brownbaerchen It would be nice to gather more info. For Open MPI, please share the output of `ompi_info`. For MPICH, please share the output of `mpichversion`. Also, in both...

After @mhrywniak gave me some tipps on how to use NSIGHT, I could finally see where the expensive memory operations occur that I have been talking about. To recap: The...

Good point @dalcinl! Indeed, multiplying a real view in-place results in a kernel called `cupy_multiply__float32_float_float32`. Unfortunately, it is not any faster... I noticed something else, though, but I am not...

> Huh, interesting.. you're right, the transfer operations are not as generic as I thought. For `mesh_to_mesh` I understand this, but are your sure about `mesh_to_mesh_fft`? No, I think `mesh_to_mesh_fft`...

I guess we actually implemented the generic Dirichlet and Neumann boundary conditions that we need at the TIME-X hackathon in Darmstadt. The only thing left to solve this issue is...