Resuming calculation after stopping

pfons · October 11, 2023, 11:27am

I am trying to compute the non-linear susceptibility for a system with 64 atoms. After some initial trouble, things seem to be going well, but as I expected it is taking a long time to compute all of the derivatives. My calculation is now in dataset 4 (see below) and I can see that multiple density and wavefunction files have been (are being) written to disk. The last few are also listed below for reference. Unfortunately, I have to stop the machine for some maintenance tomorrow and I am wondering if I can resume the calculation from dataset 4 and read in the already completed results and/or change the range of derivates calculated to just do the missing pieces. This might be a tall order and not easily possible (but it is easy to ask!). I suspect that the calculation has taken a couple of weeks to get this far. It would be nice to throw more cores at the problem to speed it up, but when I tried 96 cores instead of the current 64, the program has a mysterious crash (likely too many cores for the algorithm). Any suggestions as to how to optimize things would be appreciated.

#####################
##### DATASET 4 #####
#####################
##############################################
####                SECTION: basic               
##############################################
 kptopt4 3
 tolvrs4 1e-12
##############################################
####                SECTION: dfpt                
##############################################
 optdriver4 1
 rfphon4 1                      # compute 1st-order WF derivatives with respect to atomic displacements
 rfelfd4 3 			# compute 1st-order WF derivatives with respect to electric field
 rfstrs4 3                      # compute 1st-order WF derivatives with respect to strains
 prepanl4 1                     # make sure that response functions are correctly prepared for a non-linear computation
##############################################
####                SECTION: files               
##############################################
 getwfk4 2                      # use GS WF from dataset 2
 getddk4 3                      # use ddk WF from dataset 3 (needed for electric field)
 prtwf4 1                       # save 1st-order WF on disk, will be used in other datasets
 prtden4 1                      # save 1st-order density on disk, will be used in other datasets

-rw-r--r--  1 10001  10002   5.0G Oct 11 17:02 outnc_DS4_1WF58.nc
-rw-r--r--  1 10001  10002    15M Oct 11 17:01 outnc_DS4_POT58.nc
-rw-r--r--  1 10001  10002    15M Oct 11 17:01 outnc_DS4_DEN58.nc
-rw-r--r--  1 10001  10002   5.0G Oct 11 14:23 outnc_DS4_1WF57.nc
-rw-r--r--  1 10001  10002    15M Oct 11 14:22 outnc_DS4_POT57.nc
-rw-r--r--  1 10001  10002    15M Oct 11 14:22 outnc_DS4_DEN57.nc
-rw-r--r--  1 10001  10002   5.0G Oct 11 11:15 outnc_DS4_1WF56.nc
-rw-r--r--  1 10001  10002    15M Oct 11 11:14 outnc_DS4_POT56.nc
-rw-r--r--  1 10001  10002    15M Oct 11 11:14 outnc_DS4_DEN56.nc
 .... many more a total of 191 files so far

mverstra · March 18, 2024, 8:38am

Hi,

Each data set is certainly independent. I would recommend running them as separate jobs, and once you know which perturbation are irreducible you can even split up the datasets further.

You can resume from dtset4 using jdtset 4 5 6 7…etc and reducing ndtset accordingly.

Why more cores are not working I can’t tell off hand. Need to see how many k you have, and if nproc>nk you need to distribute the bands evenly between all cores.

M.