Abinit processes not terminating on job finish

IT-ICAMS · June 6, 2024, 8:54am

Dear community,

we are running a Linux cluster (CentOS7, SGE) here at ICAMS (https://www.icams.de) and some of our users do Abinit calculations on it.

Abinit is started using mpirun on one or more nodes.
The problem: After the scheduled job has finished, the abinit processes continue to run, which causes problems.
The SGE consideres the node as free and submits new jobs to it causing an overload of the involved calculation nodes.
We are quite busy finding and killing leftover abinit processes on the nodes.
Scripting a kill task is not trivial, since there might be active Abinit calculations side by side on nodes that have obsolete processes.

Does someone have a hint how to solve that problem?
Thank you!

Lothar (ICAMS IT team)

gingras.ol · June 6, 2024, 5:36pm

Dear Lothar,

I have never seen this behavior with Abinit. Typically, as soon as it crashes, all processes stop and the job finishes. And if it doesn’t crash and ends, then again it simply finishes.

How do your calculations end? Do they crash?

Olivier

IT-ICAMS · June 7, 2024, 7:04am

Hi Olivier,

thank you for yor reply!
Almost all the other software products that are run at the cluster terminate as expected. Even when they are terminated by the scheduler due to reaching the time limit. There’s only one other product showing sometimes the same behaviour as Abinit, but it’s very rarely used.
The Abinit jobs usually have a max runtime of 6 days, but hadly any of them reaches the limit. But a lot of them leave the processes running after that and consume CPU time. I’m not sure if it’s all of them.
I also assume they produce valid results, because I did nor receive a single complaint of the users about crashing Abinit jobs.
It looks like the calculations finish normally, but the processes continue running.
Very strange.
At the moment I’m observing two of these jobs in detail and hope I can see more after the weekend.

Lothar

mverstra · June 9, 2024, 5:12pm

Perhaps related - on one cluster we saw runs complete fine, but then issue a segfault at exit time (I think when mpi_end was called). I am not sure this problem was related to abinit itself, rather the link to the mpi flavor and specific compilation options of mpi. Have you tried with several compilers and/or MPI libraries? This would be very helpful to debug. You also mention that the same happens for 1 or several nodes?

best

Matthieu

IT-ICAMS · June 10, 2024, 7:05am

Hi Matthieu,
thank you for the hint!
We offer our software as modules on the cluster.
The Abinit module is compiled with OpenMPI 4.1.2 and gcc 11.2.0
So the module loads OpenMPI and gcc in the mentioned versions as prerequisites.
MPI and gcc are also used in conjunction with several other software products and none of them shows any strange behaviour.
However - maybe it’s the combination with Abinit (Version 9.10.3 btw.)

The job I mentioned above has finished during the weekend.
It ran on 3 nodes with a max runtime specified as 259200 seconds (3 days).
The job was terminated by the SGE because it exceeded the max runtime.
None of the 3 nodes involved shows any remains of the Abinit job.
The same for a second job, which also was terminated due to runtime.
Seems I need to find an example, where the calculation finishes regulary and the mpi_end is called.
Unfortunately this takes a while, because I could not find a job with less then 3 days runtime specified.
I think I’ll talk to one of the users to run a couple of test jobs with datasets that produce very short runtimes and finish normally.

Lothar

IT-ICAMS · June 10, 2024, 7:14am

P.S.:
The leftover abinit processes show up on different nodes all over the cluster.