Severe inefficiencies in parallel computing

iavas · June 27, 2024, 3:09am

Hello all,

I’m currently attempting to do some parallel computing. In this previous post, I found the computation to be suspiciously slow. So I tried the #2 case in this tutorial as a test.

According to the information in the tutorial, the calculation of the second input file, which contains 5 SCF steps, should be completed in about a minute using 64 processes in parallel. However, when I did it in practice, it took me more than 2 minutes to complete a single SCF step. Something must be very wrong, but I have no clue how to locate the problem. The log, input, and output files are below (I’ve manually killed this job after SCF step 3, but time information is still in the log file):
tparal_bandpw_02.log (78.6 KB)
tparal_bandpw_02.abi (10.2 KB)
tparal_bandpw_02.abo (42.2 KB)

How do I fix this problem? Or is there any other information I can provide to help pinpoint the problem? Thank you very much.

beuken · June 27, 2024, 11:08am

Hi,
just in case…

export OPENBLAS_NUM_THREADS=1
export GOTO_NUM_THREADS=1
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1

iavas · June 27, 2024, 2:07pm

I think this might be where the problem is, I tried to use a single node and the job finished pretty fast. If one “node” in my system has less than 64 CPUs and I have to use two nodes for this, but my ABINIT executable was compiled without OpenMP support, will it cause a problem like this?

beuken · June 27, 2024, 5:03pm

Regarding the netlib library, did you compile from source or install a netlib package?
LINALG libraries (OpenBlas, MKL, etc.) are generally compiled with OMP.

iavas · June 30, 2024, 4:25pm

Hi,
The cluster admin compiled this software for me, and then compiled a supposed-to-be OMP version after I asked for help. I used it to try to repeat the Hybrid Parallelism part of the tutorial, testing three sets of parameters:
tparal_bandpw_04_mpi48_thread1.abi (10.2 KB)
tparal_bandpw_04_mpi64_thread1.abi (10.2 KB)
tparal_bandpw_04_mpi32_thread2.abi (10.2 KB)
tparal_bandpw_04_mpi48_thread1.abo (192.3 KB)
tparal_bandpw_04_mpi64_thread1.abo (193.3 KB)
tparal_bandpw_04_mpi32_thread2.abo (192.8 KB)

Our cluster has only 56 CPUs per node. The result I found is that only when using one node (OMP_NUM_THREADS=1) and total cores 48 < 56, the 48 core 1 thread case, the Proc time is about 200 seconds. For the 64c1t and 32c2t cases, their Proc time is much longer.
And in all three cases, the overall time is still way longer than the result in that tutorial.