Platform: Linux Se réfère à: COMSOL Multiphysics®, COMSOL Server™ Versions: 6.1, 6.0, 5.6

Problem Description

I am observing that distributed cluster jobs on Linux are not starting up. I am receiving error messages from MPI.

Solution

The underlying reason for COMSOL not working on a Linux cluster might be that the network interface and fabrics are not detected correctly. On Linux, COMSOL 6.1 is shipped with Intel MPI 2021.6 and COMSOL 6.0 with Intel MPI 2021.2. You can investigate if there is an incompatibility with Intel MPI using the following steps:

When you find that Intel MPI is not working on your cluster, you should first make sure that your submission script is configured correctly. In addition, you should run the MPI test by calling

comsol hydra mpitest -nn 2 -f hostfile

or, e.g. with Slurm,

#SBATCH --nodes=2  
#SBATCH --ntasks-per-node=1 
...
comsol hydra mpitest -nn 2 -nnhost 1 

to see that actually MPI is the issue. You can add the switch '-mpidebug 10' for getting additional debug output.

For resolving the problem you can try the suggestions A. and B. If A. works for you, you should try B. as this option would offer better performance.

A. Fall Back to TCP

Export the environment variable FI_PROVIDER and set it to 'sockets'. With Slurm, this can be done by means of

#SBATCH --export=FI_PROVIDER=sockets

Otherwise, you can use

export FI_PROVIDER=sockets 

or

setenv FI_PROVIDER sockets 

and make sure that this environment variable is handed over to your cluster job.

If you are running cluster jobs from the COMSOL Desktop, add --export=FI_PROVIDER=sockets to the Additional scheduler arguments field. I you are using SLURM, also add the FLROOT environment variable, using a comma character as separator. The value of FLROOT should be the COMSOL installation directory path.

--export=FI_PROVIDER=sockets,FLROOT=<COMSOL installation directory>

The downside with this approach is that the communication falls back to TCP, which might be slow if you have a faster fabrics.

B. Install a Later Intel MPI

Download the latest Intel MPI from here and install it. You can install to your home directory if you don't have admin rights on the cluster.

Launch COMSOL with the additional switch

-mpiroot <Intel MPI installation directory>/intel/oneapi/mpi/latest 

On Slurm, you can call for example

#SBATCH --nodes=2  
#SBATCH --ntasks-per-node=1 
...
comsol hydra mpitest -nn 2 -nnhost 1 -mpiroot <Intel MPI installation directory>/intel/oneapi/mpi/latest

Remarks:

  • You can also point to other MPICH2-based MPI installations (but not to OpenMPI for example)
  • In COMSOL 5.6 you can point to the new Intel MPI via -mpiroot as well.