Evaluating support for Intel GPUs through OpenMP/offload

Dear ABINIT developers,

I’m interested on adding support to Intel GPUs through OpenMP/offload and I’d like to get some feedback/directions on how you believe it is the best way to start. I’ve been skim-reading the code and I have some doubts that I’d appreciate you help me understand/confirm:

  1. As of now, support for GPUs relies on OpenMP/offload but also requires some sort of low-level programming through CUDA and HIP (in addition to some respective math libraries, e.g. CUBLAS). If I look into shared/common/src/17_gpu_toolbox I see that dev_spec_[cuda|hip] have some support functions from the lower level libraries. My questions about this are:
    1.1) If plan is to go with OpenMP, would it make sense to create some sort of dev_spec_omp_offload which unifies these, or at least those supporting OpenMP offload?
    1.2) There are some FFT and linear algebra -related files there, which I believe that I could forward to oneMKL, or do you think otherwise?
    1.2.1) Not sure, if fully related – does this mean that there is some linear algebra or FFTs that are not currently deployed in GPUs? I wonder if one could use MKL (or whatever math lib) for all these computations and then the library decide on whether to use GPU underneath…
    1.3) There are the timing_[cuda|hip] files that seem to use the GPU events to track some device time, but they print some data on the screen (see calc_cuda_time_, for instance) – are these actually used? Does ABINIT have some internal profiler?
    1.4) How important is get_gpu_max_mem() ? Not sure we have this feature right now except possibly some OpenMP’s interop extensions.

  2. Also, about specific CUDA folders, there exists src/46_manage_cuda with CUDA code in there. Do you think this is needed to be ported for a first round of enabling?

  3. I’ve read in some source comments (sorry don’t recall which exact file) that the plan is to reach a point in where the application uses managed memory (shared memory), is that right?

  4. Intertwined with 1), I’m currently using the autoconf mechanism to configure my package – and it seems to require&test against CUDA or HIP libraries – but on Intel, I believe that most of these math libraries reside in oneMKL, so there may be no need to further check for GPU libs/capabilities. If so, I could “skip” some of the configuration in autoconf scripts?

  5. Also related to configuration – do you suggest use autoconf configure or cmake? My understanding is that cmake for ABINIT is still on the works, right?

  6. Is there a test/validation test that I can use for checking?

Since there are many topics here, and I may miss many others, would you be open to have a call to discuss the topic?

Thank you so much in advance.

Dear Harald,

There are many questions in your post. I am not sure I will be able to answer all of them.

Before I begin, let me introduce myself: I am Marc Torrent from CEA (France), and I am responsible (along with 1 or 2 others) for the GPU porting of Abinit. This port is quite recent and is still evolving; therefore, we must be very careful when making changes. Most importantly, as you have just done, it is crucial to contact us before proceeding.

My first question is: what exactly is your objective? To contribute to the Abinit project, which is open source? To use Abinit on your clusters and have a private version? To promote Intel clusters?

Now, to answer your questions:

1/ There are several GPU versions coexisting in ABINIT: a “legacy” version that used CUDA directly, a “Kokkos” version, and an “OpenMP offload” version. The latter is the official GPU port. The other two are kept for historical reasons. The OpenMP version does not use CUDA code directly, except for card detection and capability measurement. There is also an equivalent for AMD cards and HIP/ROCm. Much of the low-level code is used by the older GPU ports. You need to follow the code by activating gpu_option=“GPU_OPENMP” in the Abinit input and see which routines are used.
In the code, the only hardware-dependent parts are the mathematical libraries (BLAS/LAPACK, FFT).

1.1 I would prefer not to change anything for now and to add a routine for Intel. We will look at encapsulation later.
1.2 There are wrappers made for this purpose.
1.2.1 No. Some parts are not deployed because they are not profitable on GPU.
1.3 Our own profiling is done via nvtx or roctx. Other timing routine may be obsolete.
1.4 Very important !: the distribution of tasks and memory is done according to this parameter. Memory is really the limiting factor for using Abinit on GPU. Many routines are automated based on this parameter.

  1. Most (if not all) of these routines are not used. These are the “legacy” or “kokkos” codes I mentioned.

  2. This is done and operational in the development version.

4/5/ You should prioritize the cmake build system. The autotools build system is maintained but is not the future of Abinit. However, the cmake build system is still experimental and incomplete. If you really want to use autotools, you will need to create new variables. We have recently changed this. Did you take the latest version, 10.4.3? There are variables like with-cuda, CUDA_FCFLAGS, CUDA_LIBS, with-rocm, ROCM_FCFLAGS, ROCM_LIBS, etc. The same convention should be adopted for Intel.

6/ Yes, and this is mandatory. See tests/gpu_omp and tests/hpc_gpu_omp. All these tests should be successful, at least.

I suggest we continue this discussion via private message. Even though, at the moment, time is a bit limited.

Marc Torrent
marc.torrent@cea.fr