Error parallelizing DFPT calculation

Hi,

I am running a response function calculation with MPI parallelization for BaTiO3. When I run the calculation with an energy cutoff of 35 Ha on 20 cores, it finishes successfully, but if I try to increase the number of cores or the energy cutoff, it crashes. The run stops at the phonons and energy perturbation step and the last line in the log file is, “-open ddk wf file :…”.

It seems like a memory limitation, but I have 256 GB memory per node. I also tried running with 4 nodes (1028 GB memory) and it still crashed. Could it be a problem with MPI processes not sharing memory correctly? Any help would be greatly appreciated.

Thank you.

Hi Apte.

Which version of abinit? Which platform? Send us a full header for the log file to see compilation options, and your inputs so we know what you are doing.

If you change the ecut on the fly it will have a hard time reading the old ddk files. I presume you are re-doing the whole thing from scratch: GS DDK then phonons

Which perturbation is it doing when it crashes? Phonon displacement or E field? We had similar problems with a lot of io traffic around that moment in the run for ddE perturbations. I improved the caching of the DDK wf reading, but that didn’t change much. Do you have node access to see if the memory is the problem? No error messages?

Thanks for the quick reply!

I am using ABINIT 9.6.2. Here is the log file: GitHub - ASohm/BaTiO3 (sorry, I am unable to upload it as an attachment). Yes, I recalculate ddk every time I change ecut. The run crashes for ddE perturbations, phonon displacements work fine.

I used a system monitoring utility (remora) and got the following:
Run #1 - 1 node, 16 cores. Calculation completes successfully.
Max virtual memory = 301.2 GB
Max resident memory = 8.6 GB
Min free memory = 232.9

Run #2 - 1 node, 32 cores. Calculation fails at ddE.
Max virtual memory = 1137.1 GB
Max resident memory = 15.6 GB
Min free memory = 226.6 GB

Update:

I tried using ABINIT 9.4.2, and it is working correctly. Even with 128 cores and a large energy cutoff, the run finished successfully. I looked at the memory usage, and it was identical to 9.6.2. There seems to be a bug in the new version, but I could not figure out what it is.

1 Like