EPH Task 11 netCDF error

Radioteddy · January 27, 2023, 2:43pm

Dear Abinit developers,

I am trying to get e-ph matrix elements in the full k- and q- BZs by setting eph_task = 11. When I run the calculation on a single core, everything is fine, but I get such an error, when I run in parallel regime:

--- !ERROR
src_file: m_gstore.F90
src_line: 2844
mpi_rank: 12
message: |
    No msg from caller - NetCDF library returned: `NetCDF: Can't open HDF5 attribute`
...

Not sure if this error is directly related to number of cores tbh. I’ve thought that this is because some of cpus do not get qpoints, however, reduction of cpu number did not change anything.

I will appreciate any suggestions.

Best,
Fedor

mverstra · January 27, 2023, 5:18pm

Hello Fedor,

looks like there is an mpi incompatibility between your compilations of hdf5, ncdf, and abinit. You need to have the same libraries and mpi compilation for all 3, to get the parallel mpi-io access to netcdf (hdf5) format files. At the end of the configuration step, you should have a set of flags showing whether the hdf, ncdf, ncdf-fortran are all correctly compiled with mpi support. Can also check the config.h file for HAVE_NETCDF_MPI etc…

Radioteddy · January 27, 2023, 5:49pm

Hi Mverstra,

I’ve also thought so but in that case other netCDF calls should have been crashed, right? But if I run e.g. GS calculation everything is smoothly written into OUT.nc …

Nevertheless, here are my config.h file: config.h.txt (21.1 KB)
config.ac file: config.ac.txt (1007 Bytes)
and the configuration output:

  * C compiler        : gnu version 11.3
  * Fortran compiler  : gnu version 11.3
  * architecture      : amd epyc (64 bits)
  * debugging         : basic
  * optimizations     : standard

  * OpenMP enabled    : no (collapse: ignored)
  * MPI    enabled    : yes (flavor: auto)
  * MPI    in-place   : no
  * MPI-IO enabled    : yes
  * GPU    enabled    : no (flavor: none)

  * LibXML2 enabled   : no
  * LibPSML enabled   : no
  * XMLF90  enabled   : no
  * HDF5 enabled      : yes (MPI support: yes)
  * NetCDF enabled    : yes (MPI support: yes)
  * NetCDF-F enabled  : yes (MPI support: yes)

  * FFT flavor        : fftw3 (libs: user-defined)
  * LINALG flavor     : openblas (libs: user-defined)
  * SCALAPACK enabled : no
  * ELPA enabled      : no
  * MAGMA enabled     : unknown (magma version >= 1.5 ? )

From these it seems that everything should be fine…

mverstra · January 27, 2023, 7:55pm

Indeed, configure looks fine. Perhaps at run time it finds another netcdf or hdf library shared object?

ldd ` which abinit `

You can also try interactively running in parallel. In the end there may be a real bug @gmatteo? But the code has been used in parallel.

Radioteddy · January 28, 2023, 3:37pm

For me, list of dependencies looks fine as well:
ldd.txt (4.5 KB)
Before I run calculation, I always purge all of loaded modules and load them again in order to prevent mixing of different libraries versions.

Interactive run just throws a MPI error when tries to calculate and store phonon frequencies

...
  === MPI distribution ===
P Number of CPUs for parallelism over perturbations: 1
P Number of perturbations treated by this CPU: 3
P Number of CPUs for parallelism over q-points: 1
P Number of CPUs for parallelism over k-points: 4
 Computing phonon frequencies and displacements in the IBZ
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 14.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

Radioteddy · February 8, 2023, 4:14pm

Hello,
I tried to install ABINIT 9.8.2 on my own laptop and look if anything changes. Again, according to config.h and configure output everything is fine. And again, I got an error at the same stage of calculation when I ran in parallel, but the error itself was slightly different:

--- !ERROR
src_file: m_gstore.F90
src_line: 2841
mpi_rank: 1
message: |
    No msg from caller - NetCDF library returned: `NetCDF: Index exceeds dimension bound`
...

I’m still not sure if there is no problems at compilation/runtime stage though. I hope my last observation can help to find the source of the issue.

joaocarloscabreu · February 9, 2023, 8:12pm

Hello Fedor,

How is your distribution of the parallel calculation? Can you try to distribute few processors (2 for example ) only for perturbation in one case, for q-points in other case, bands in another case?

Joao

Radioteddy · February 15, 2023, 10:17am

Hello Joao,

Before I had not explicitly specified the parallel distribution assuming that the code will do it better.
Now I tried to turn on eph_np_pqbks option as you suggested, either for perturbations, q-points or bands only. I still get the error in the line 2841 on my laptop and in line 2844 on the cluster.

Best,
Fedor

mverstra · March 7, 2023, 10:19pm

Hi Fedor,

If you have a quick input file which prepares and runs you case (and crashes!) send it on.

to debug this further, could you add some print statements in m_gstore.F90?
The parallelization is splitting up the gstore ncdf writing in an erroneous way.

Something like:

call xmpi_split_block(gstore%nqibz, gstore%comm, my_nqibz, my_iqibz_inds)
print *, "my_nqibz, my_iqibz_inds ", my_nqibz, my_iqibz_inds
ABI_MALLOC(buf_wqnu, (natom3, my_nqibz))
ABI_MALLOC(buf_eigvec_cart, (2, 3, natom, natom3, my_nqibz))

…
…

if (my_nqibz > 0) then
iq_start = my_iqibz_inds(1)
print *, "iq_start ", iq_start, iq_start+ my_nqibz
ncerr = nf90_put_var(root_ncid, root_vid(“phfreqs_ibz”), buf_wqnu, start=[1, iq_start], count=[natom3, my_nqibz])
NCF_CHECK(ncerr)
ncerr = nf90_put_var(root_ncid, root_vid(“pheigvec_cart_ibz”), buf_eigvec_cart, &
start=[1,1,1,1,iq_start], count=[2, 3, natom, natom3, my_nqibz])
NCF_CHECK(ncerr)
end if

Radioteddy · March 13, 2023, 4:38pm

Hi Mverstra,

If you have a quick input file which prepares and runs you case (and crashes!) send it on.

I used python script for the generation of quite simple workflow based on the example in Abipy documentation. Here it is:
al_flow.py.txt (2.6 KB)

For the EPH calculations, I just added three additional variables into flow_dir/w2/t0/run.abi:
gstore_cplex 1
gstore_kzone “ibz”
gstore_qzone “bz”

to debug this further, could you add some print statements in m_gstore.F90?

Done. For example, I ran the calculation on 4 cpus: log.txt (45.4 KB)
I also tried to play a bit with printing and check what is the value of ncerr. For me it looks like that something is wrong with invoking of nf90_put_var(…) on “slave” cpus because program crashes before it prints ncerr.

Best regards,
Fedor

mverstra · March 13, 2023, 5:08pm

That really sounds like an issue with a non-MPIIO enabled netcdf/hdf5 library. Maybe you compiled fine, so the config.log and abi.out claim everything is ok, but at execution time the LD_LIBRARY_PATH actually finds another libnetcdf.so which is not MPI enabled…

You can also use fortran flush(unit) to ensure all nodes are really up to date with their printing. You can create a file called _LOG to get each proc to save it’s data:

…/src/44_abitypes_defs/m_dtfil.F90: ! if a _LOG file exists, a LOG file and a STATUS file are created for each cpu core

@gmatteo any ideas?