Jim Freels
mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist
Please login with a confirmed email address before reporting spam
Posted:
1 decade ago
27 août 2011, 11:59 UTC−4
If you search for my last name on the COMSOL web site, you will find several papers in the COMSOL conferences where i have included results of parallel processing performance on our RHEL cluster here. We purchased the cluster essentially for COMSOL (but use it for other reasons as well). I have included a couple of the links below. The first paper only shows results for shared-memory parallel processing prior to the ability of COMSOL to exercise distributed parallel processing. Some DPP results are shown in the next link. We are now exercising the parallel capability on our "real" problem and will soon be using the new iterative-solver parallel capability in v4.2.
www.comsol.com/shared/downloads/stories/oakridge_nuclear_safety_related_procedures/oakridge_nuclear_safety_related_procedures.pdf
www.comsol.com/papers/7970/download/freels_presentation.pdf
There is a key note by Dr. Darrel Pepper in this year's conference that I am looking forward to hearing about his experiences on (perhaps) a larger cluster. I hope to try out COMSOL soon on a larger cluster here at ORNL. So far, with 12 nodes, i have seen no rollover in the speedup. I suspect that COMSOL can scale to a fairly large number of compute nodes provided the communication link is also very robust (infiniband or better).
If you search for my last name on the COMSOL web site, you will find several papers in the COMSOL conferences where i have included results of parallel processing performance on our RHEL cluster here. We purchased the cluster essentially for COMSOL (but use it for other reasons as well). I have included a couple of the links below. The first paper only shows results for shared-memory parallel processing prior to the ability of COMSOL to exercise distributed parallel processing. Some DPP results are shown in the next link. We are now exercising the parallel capability on our "real" problem and will soon be using the new iterative-solver parallel capability in v4.2.
http://www.comsol.com/shared/downloads/stories/oakridge_nuclear_safety_related_procedures/oakridge_nuclear_safety_related_procedures.pdf
http://www.comsol.com/papers/7970/download/freels_presentation.pdf
There is a key note by Dr. Darrel Pepper in this year's conference that I am looking forward to hearing about his experiences on (perhaps) a larger cluster. I hope to try out COMSOL soon on a larger cluster here at ORNL. So far, with 12 nodes, i have seen no rollover in the speedup. I suspect that COMSOL can scale to a fairly large number of compute nodes provided the communication link is also very robust (infiniband or better).
Please login with a confirmed email address before reporting spam
Posted:
1 decade ago
15 sept. 2011, 17:11 UTC−4
Hi James,
I have looked over your slides and I have a question for you. You show in the second presentation the memory usage for distributed parallel processing usage. Your plot makes sense to me but what I observe in my case is not what I have expected.
I am increasing the number of nodes in my cluster without seeing a decrease in the used memory per node (each node has virtually 8 cores). In other words a problem which needs 500GB of memory on 10 nodes will use 1TB of memory on 20 nodes. Any ideas why I see this behaviour?
Thank you in advance for your help.
Dragos
If you search for my last name on the COMSOL web site, you will find several papers in the COMSOL conferences where i have included results of parallel processing performance on our RHEL cluster here. We purchased the cluster essentially for COMSOL (but use it for other reasons as well). I have included a couple of the links below. The first paper only shows results for shared-memory parallel processing prior to the ability of COMSOL to exercise distributed parallel processing. Some DPP results are shown in the next link. We are now exercising the parallel capability on our "real" problem and will soon be using the new iterative-solver parallel capability in v4.2.
www.comsol.com/shared/downloads/stories/oakridge_nuclear_safety_related_procedures/oakridge_nuclear_safety_related_procedures.pdf
www.comsol.com/papers/7970/download/freels_presentation.pdf
There is a key note by Dr. Darrel Pepper in this year's conference that I am looking forward to hearing about his experiences on (perhaps) a larger cluster. I hope to try out COMSOL soon on a larger cluster here at ORNL. So far, with 12 nodes, i have seen no rollover in the speedup. I suspect that COMSOL can scale to a fairly large number of compute nodes provided the communication link is also very robust (infiniband or better).
Hi James,
I have looked over your slides and I have a question for you. You show in the second presentation the memory usage for distributed parallel processing usage. Your plot makes sense to me but what I observe in my case is not what I have expected.
I am increasing the number of nodes in my cluster without seeing a decrease in the used memory per node (each node has virtually 8 cores). In other words a problem which needs 500GB of memory on 10 nodes will use 1TB of memory on 20 nodes. Any ideas why I see this behaviour?
Thank you in advance for your help.
Dragos
[QUOTE]
If you search for my last name on the COMSOL web site, you will find several papers in the COMSOL conferences where i have included results of parallel processing performance on our RHEL cluster here. We purchased the cluster essentially for COMSOL (but use it for other reasons as well). I have included a couple of the links below. The first paper only shows results for shared-memory parallel processing prior to the ability of COMSOL to exercise distributed parallel processing. Some DPP results are shown in the next link. We are now exercising the parallel capability on our "real" problem and will soon be using the new iterative-solver parallel capability in v4.2.
http://www.comsol.com/shared/downloads/stories/oakridge_nuclear_safety_related_procedures/oakridge_nuclear_safety_related_procedures.pdf
http://www.comsol.com/papers/7970/download/freels_presentation.pdf
There is a key note by Dr. Darrel Pepper in this year's conference that I am looking forward to hearing about his experiences on (perhaps) a larger cluster. I hope to try out COMSOL soon on a larger cluster here at ORNL. So far, with 12 nodes, i have seen no rollover in the speedup. I suspect that COMSOL can scale to a fairly large number of compute nodes provided the communication link is also very robust (infiniband or better).
[/QUOTE]
Please login with a confirmed email address before reporting spam
Posted:
1 decade ago
15 sept. 2011, 18:01 UTC−4
Dear All,
I have read the documentation again....
For memory hungry models one should use -nn 10 -np 8 as opposed to -nn 80.
For small models -nn 80 gives the best gain.
In the last couple of days I went nuts with this. I can report the simulation time goes down from 2655s to 755s and the memory goes down from ~50GB/node to ~16GB/node.
James, I suppose your memory usage plot uses the first configuration where you specify the number of processors(cores) per node. Am I right?
Thanks,
Dragos
Hi James,
I have looked over your slides and I have a question for you. You show in the second presentation the memory usage for distributed parallel processing usage. Your plot makes sense to me but what I observe in my case is not what I have expected.
I am increasing the number of nodes in my cluster without seeing a decrease in the used memory per node (each node has virtually 8 cores). In other words a problem which needs 500GB of memory on 10 nodes will use 1TB of memory on 20 nodes. Any ideas why I see this behaviour?
Thank you in advance for your help.
Dragos
If you search for my last name on the COMSOL web site, you will find several papers in the COMSOL conferences where i have included results of parallel processing performance on our RHEL cluster here. We purchased the cluster essentially for COMSOL (but use it for other reasons as well). I have included a couple of the links below. The first paper only shows results for shared-memory parallel processing prior to the ability of COMSOL to exercise distributed parallel processing. Some DPP results are shown in the next link. We are now exercising the parallel capability on our "real" problem and will soon be using the new iterative-solver parallel capability in v4.2.
www.comsol.com/shared/downloads/stories/oakridge_nuclear_safety_related_procedures/oakridge_nuclear_safety_related_procedures.pdf
www.comsol.com/papers/7970/download/freels_presentation.pdf
There is a key note by Dr. Darrel Pepper in this year's conference that I am looking forward to hearing about his experiences on (perhaps) a larger cluster. I hope to try out COMSOL soon on a larger cluster here at ORNL. So far, with 12 nodes, i have seen no rollover in the speedup. I suspect that COMSOL can scale to a fairly large number of compute nodes provided the communication link is also very robust (infiniband or better).
Dear All,
I have read the documentation again....
For memory hungry models one should use -nn 10 -np 8 as opposed to -nn 80.
For small models -nn 80 gives the best gain.
In the last couple of days I went nuts with this. I can report the simulation time goes down from 2655s to 755s and the memory goes down from ~50GB/node to ~16GB/node.
James, I suppose your memory usage plot uses the first configuration where you specify the number of processors(cores) per node. Am I right?
Thanks,
Dragos
[QUOTE]
Hi James,
I have looked over your slides and I have a question for you. You show in the second presentation the memory usage for distributed parallel processing usage. Your plot makes sense to me but what I observe in my case is not what I have expected.
I am increasing the number of nodes in my cluster without seeing a decrease in the used memory per node (each node has virtually 8 cores). In other words a problem which needs 500GB of memory on 10 nodes will use 1TB of memory on 20 nodes. Any ideas why I see this behaviour?
Thank you in advance for your help.
Dragos
[QUOTE]
If you search for my last name on the COMSOL web site, you will find several papers in the COMSOL conferences where i have included results of parallel processing performance on our RHEL cluster here. We purchased the cluster essentially for COMSOL (but use it for other reasons as well). I have included a couple of the links below. The first paper only shows results for shared-memory parallel processing prior to the ability of COMSOL to exercise distributed parallel processing. Some DPP results are shown in the next link. We are now exercising the parallel capability on our "real" problem and will soon be using the new iterative-solver parallel capability in v4.2.
http://www.comsol.com/shared/downloads/stories/oakridge_nuclear_safety_related_procedures/oakridge_nuclear_safety_related_procedures.pdf
http://www.comsol.com/papers/7970/download/freels_presentation.pdf
There is a key note by Dr. Darrel Pepper in this year's conference that I am looking forward to hearing about his experiences on (perhaps) a larger cluster. I hope to try out COMSOL soon on a larger cluster here at ORNL. So far, with 12 nodes, i have seen no rollover in the speedup. I suspect that COMSOL can scale to a fairly large number of compute nodes provided the communication link is also very robust (infiniband or better).
[/QUOTE]
[/QUOTE]
Jim Freels
mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist
Please login with a confirmed email address before reporting spam
Posted:
1 decade ago
15 sept. 2011, 18:14 UTC−4
Yes. That is correct. In nearly all cases that I can remember, I always have specified both -nn and -np on the command line. This is reinforced now because our cluster now has some nodes with more cores than others. Our older nodes have 8 cores per node, but our new nodes have 16 cores/node. It seems to me that COMSOL requires the same number of cores per node in order to balance the load.
My issue now is that the load is not balancing for periods of time where the solver is spending in the finite-element assembly process. During the actual solver and matrix factorization, it is fairly balanced. But not the assembly. The memory utilization is fairly balanced, but not linear with the nodes as the surface plots show.
I have not run on a large cluster yet. I expect to do that soon. I am interested in your performance on 80 nodes.
One of the key note talks at the conference this year (Pepper) will discuss performance on large clusters.
Yes. That is correct. In nearly all cases that I can remember, I always have specified both -nn and -np on the command line. This is reinforced now because our cluster now has some nodes with more cores than others. Our older nodes have 8 cores per node, but our new nodes have 16 cores/node. It seems to me that COMSOL requires the same number of cores per node in order to balance the load.
My issue now is that the load is not balancing for periods of time where the solver is spending in the finite-element assembly process. During the actual solver and matrix factorization, it is fairly balanced. But not the assembly. The memory utilization is fairly balanced, but not linear with the nodes as the surface plots show.
I have not run on a large cluster yet. I expect to do that soon. I am interested in your performance on 80 nodes.
One of the key note talks at the conference this year (Pepper) will discuss performance on large clusters.
Please login with a confirmed email address before reporting spam
Posted:
1 decade ago
15 sept. 2011, 19:02 UTC−4
Hi James,
So, let's say you have two nodes, one with 8 cores and one with 16 cores. Will you use a -nn 3 -np 8 to try and balance the load? I will try to boot instances with half the number of cores and see how they behave. In any case I can see the cores are not working all the time as they used to when I specified the -nn 80 instead of -nn 10 -np 8 but at least I am no longer running out of memory.
Thanks,
Dragos
Yes. That is correct. In nearly all cases that I can remember, I always have specified both -nn and -np on the command line. This is reinforced now because our cluster now has some nodes with more cores than others. Our older nodes have 8 cores per node, but our new nodes have 16 cores/node. It seems to me that COMSOL requires the same number of cores per node in order to balance the load.
My issue now is that the load is not balancing for periods of time where the solver is spending in the finite-element assembly process. During the actual solver and matrix factorization, it is fairly balanced. But not the assembly. The memory utilization is fairly balanced, but not linear with the nodes as the surface plots show.
I have not run on a large cluster yet. I expect to do that soon. I am interested in your performance on 80 nodes.
One of the key note talks at the conference this year (Pepper) will discuss performance on large clusters.
Hi James,
So, let's say you have two nodes, one with 8 cores and one with 16 cores. Will you use a -nn 3 -np 8 to try and balance the load? I will try to boot instances with half the number of cores and see how they behave. In any case I can see the cores are not working all the time as they used to when I specified the -nn 80 instead of -nn 10 -np 8 but at least I am no longer running out of memory.
Thanks,
Dragos
[QUOTE]
Yes. That is correct. In nearly all cases that I can remember, I always have specified both -nn and -np on the command line. This is reinforced now because our cluster now has some nodes with more cores than others. Our older nodes have 8 cores per node, but our new nodes have 16 cores/node. It seems to me that COMSOL requires the same number of cores per node in order to balance the load.
My issue now is that the load is not balancing for periods of time where the solver is spending in the finite-element assembly process. During the actual solver and matrix factorization, it is fairly balanced. But not the assembly. The memory utilization is fairly balanced, but not linear with the nodes as the surface plots show.
I have not run on a large cluster yet. I expect to do that soon. I am interested in your performance on 80 nodes.
One of the key note talks at the conference this year (Pepper) will discuss performance on large clusters.
[/QUOTE]
Jim Freels
mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist
Please login with a confirmed email address before reporting spam
Posted:
1 decade ago
15 sept. 2011, 22:24 UTC−4
No. I will use -nn 2 -np 8
I interpret this as
-nn number of compute nodes
-np number of processors per compute node (or total number of cores per compute node).
So, in this situation, you end up with 8 cores on the 16-core node that are not used by COMSOL and are free to be used by another application.
I think in your case where you set -nn 80, it was filling up the first set of compute nodes in your cluster, then starting another set , and so forth. So if you have a 10-node cluster, you probably had 8 instances of comsol running on each of your compute nodes. I am amazed it ran at all. It must have been writing alot to swap space on your disk drive.
How many compute nodes do you have in your cluster, and how much memory on each compute node, and how many cores on each compute node. I can then recommend to you what to use as your settings. How big a problem are you running ?
No. I will use -nn 2 -np 8
I interpret this as
-nn number of compute nodes
-np number of processors per compute node (or total number of cores per compute node).
So, in this situation, you end up with 8 cores on the 16-core node that are not used by COMSOL and are free to be used by another application.
I think in your case where you set -nn 80, it was filling up the first set of compute nodes in your cluster, then starting another set , and so forth. So if you have a 10-node cluster, you probably had 8 instances of comsol running on each of your compute nodes. I am amazed it ran at all. It must have been writing alot to swap space on your disk drive.
How many compute nodes do you have in your cluster, and how much memory on each compute node, and how many cores on each compute node. I can then recommend to you what to use as your settings. How big a problem are you running ?
Please login with a confirmed email address before reporting spam
Posted:
1 decade ago
16 sept. 2011, 18:09 UTC−4
Hi James,
If you want to use all the cores in your cluster I encourage you to try -nn 3 -np 8 or even a -nn 6 -np 4 should work (this is for the example with two physical machines with 16 and 8 cores respectively). MPD should be able to distribute the nodes and processors accordingly. I think it also depends on how you start mpd but I am not positive here. In my case I start mpd on each individual machine and I use --ncpus to specify the number of cores for each machine. This way I end up with a pool of 80 cores and it is up to me to choose the right combination of nodes and processors. The only constraint is
nodes*processors<=80
I do not look at the number of nodes, i.e. –nn, as the physical machines but rather as the number of logical compute cluster nodes. This way I can chose -nn from 8 up to 80.
I have run an example with the following combinations:
-nn 10 -np 8 (one compute cluster node per machine)
-nn 20 -np 4 (two compute cluster nodes per machine)
-nn 40 -np 2 (four compute cluster nodes per machine)
-nn 80 (no -np) (8 compute cluster nodes per machine)
MPD is smart enough to distribute the compute nodes uniformly across the cluster. In the documentation it is stated that a big model should be run with the topmost configuration whereas a small model would benefit from the last configuration. Also, I am not at all worried about large scale deployments as MPI should work with a really huge pool of processors. I can tell you I have solved a problem using –nn 160 and I had no issues.
Thanks,
Dragos
Hi James,
If you want to use all the cores in your cluster I encourage you to try -nn 3 -np 8 or even a -nn 6 -np 4 should work (this is for the example with two physical machines with 16 and 8 cores respectively). MPD should be able to distribute the nodes and processors accordingly. I think it also depends on how you start mpd but I am not positive here. In my case I start mpd on each individual machine and I use --ncpus to specify the number of cores for each machine. This way I end up with a pool of 80 cores and it is up to me to choose the right combination of nodes and processors. The only constraint is
nodes*processors
Jim Freels
mechanical side of nuclear engineering, multiphysics analysis, COMSOL specialist
Please login with a confirmed email address before reporting spam
Posted:
1 decade ago
17 sept. 2011, 11:41 UTC−4
I was not aware that you could do this sort of thing, but it makes sense. I wonder if you break up a physical compute node into smaller compute nodes, then doesn't that mean that the data transfer for the distributed parallel processing takes place across ethernet (or inifiniband in our case) instead of the shared memory ? Is MPI smart enough to detect that it could transfer data within the bandwidth of the motherboard (faster) instead of the cables (shower) ?
Also, I wonder if you break down the physical processor into virtual processors (for lack of a better word) using a larger number for the -nn switch than the number of compute nodes that you physically have, does the efficiency within the processor go up or down ? For example, on average for my current job running, COMSOL is using using 6/8 of my cores (looking at the 15-minute average over an hour or day or week using the Ganglia tool) on some nodes, but only 2/8 cores on another node. COMSOL is not dividing the load very well during the finite-element assembly process. So, if you break it down to smaller virtual compute nodes, does this efficiency go up or down ?
In other words, for a given job, will this make it run faster or slower ?
I was not aware that you could do this sort of thing, but it makes sense. I wonder if you break up a physical compute node into smaller compute nodes, then doesn't that mean that the data transfer for the distributed parallel processing takes place across ethernet (or inifiniband in our case) instead of the shared memory ? Is MPI smart enough to detect that it could transfer data within the bandwidth of the motherboard (faster) instead of the cables (shower) ?
Also, I wonder if you break down the physical processor into virtual processors (for lack of a better word) using a larger number for the -nn switch than the number of compute nodes that you physically have, does the efficiency within the processor go up or down ? For example, on average for my current job running, COMSOL is using using 6/8 of my cores (looking at the 15-minute average over an hour or day or week using the Ganglia tool) on some nodes, but only 2/8 cores on another node. COMSOL is not dividing the load very well during the finite-element assembly process. So, if you break it down to smaller virtual compute nodes, does this efficiency go up or down ?
In other words, for a given job, will this make it run faster or slower ?
Please login with a confirmed email address before reporting spam
Posted:
1 decade ago
20 sept. 2011, 00:35 UTC−4
Hi James,
If you have a machine with more than one worker then MPI uses the loopback interface to communicate with workers on the same machine. However, as you increase the number of workers the demand for memory increases. So, let's say you have 10 physical machines, each having 8 cores. From my experience if you want to reduce the amount of used memory you will lunch the job with
-nn 10 -np 8
configuration, i.e. one worker per physical machine. My models are big so I am monitoring the memory load. I can report that besides the unbalanced CPU load which you have observed, the memory is not equally distributed among the 10 physical machines as well. If your models are small you can do a
-nn 80 -np 1
in which case the memory requirement is more than double the previous case. However, your cores will work ALL the time because you have assigned one worker per core.
You will have to experiment with a real cluster and see if there is a speed gain or not. I can tell you I did not see a definitive speedup but it might be because my cluster is virtual. In my case the average ping time between nodes is between 0.27ms to 0.35ms.
Please let me know if you observe any change in speedup for different -nn/-np configurations.
Thanks,
Dragos
I was not aware that you could do this sort of thing, but it makes sense. I wonder if you break up a physical compute node into smaller compute nodes, then doesn't that mean that the data transfer for the distributed parallel processing takes place across ethernet (or inifiniband in our case) instead of the shared memory ? Is MPI smart enough to detect that it could transfer data within the bandwidth of the motherboard (faster) instead of the cables (shower) ?
Also, I wonder if you break down the physical processor into virtual processors (for lack of a better word) using a larger number for the -nn switch than the number of compute nodes that you physically have, does the efficiency within the processor go up or down ? For example, on average for my current job running, COMSOL is using using 6/8 of my cores (looking at the 15-minute average over an hour or day or week using the Ganglia tool) on some nodes, but only 2/8 cores on another node. COMSOL is not dividing the load very well during the finite-element assembly process. So, if you break it down to smaller virtual compute nodes, does this efficiency go up or down ?
In other words, for a given job, will this make it run faster or slower ?
Hi James,
If you have a machine with more than one worker then MPI uses the loopback interface to communicate with workers on the same machine. However, as you increase the number of workers the demand for memory increases. So, let's say you have 10 physical machines, each having 8 cores. From my experience if you want to reduce the amount of used memory you will lunch the job with
-nn 10 -np 8
configuration, i.e. one worker per physical machine. My models are big so I am monitoring the memory load. I can report that besides the unbalanced CPU load which you have observed, the memory is not equally distributed among the 10 physical machines as well. If your models are small you can do a
-nn 80 -np 1
in which case the memory requirement is more than double the previous case. However, your cores will work ALL the time because you have assigned one worker per core.
You will have to experiment with a real cluster and see if there is a speed gain or not. I can tell you I did not see a definitive speedup but it might be because my cluster is virtual. In my case the average ping time between nodes is between 0.27ms to 0.35ms.
Please let me know if you observe any change in speedup for different -nn/-np configurations.
Thanks,
Dragos
[QUOTE]
I was not aware that you could do this sort of thing, but it makes sense. I wonder if you break up a physical compute node into smaller compute nodes, then doesn't that mean that the data transfer for the distributed parallel processing takes place across ethernet (or inifiniband in our case) instead of the shared memory ? Is MPI smart enough to detect that it could transfer data within the bandwidth of the motherboard (faster) instead of the cables (shower) ?
Also, I wonder if you break down the physical processor into virtual processors (for lack of a better word) using a larger number for the -nn switch than the number of compute nodes that you physically have, does the efficiency within the processor go up or down ? For example, on average for my current job running, COMSOL is using using 6/8 of my cores (looking at the 15-minute average over an hour or day or week using the Ganglia tool) on some nodes, but only 2/8 cores on another node. COMSOL is not dividing the load very well during the finite-element assembly process. So, if you break it down to smaller virtual compute nodes, does this efficiency go up or down ?
In other words, for a given job, will this make it run faster or slower ?
[/QUOTE]
Please login with a confirmed email address before reporting spam
Posted:
1 decade ago
7 oct. 2011, 08:25 UTC−4
Coming back to the initial post by Vu Le, I would like to share some timing results obtained on our Linux RedHat cluster pointing out the importance of the choice of the test case.
For some time, I got the same kind of disappointing results as those of Vu Le. The test case I used was a linear diffusion equation with a very fine mesh and 8 coupled unknowns, altogether 6.7 MDoFs. The solver used was MUMPS, adapted to shared memory computation. The test case looked fine but the results showed no shared memory improvement at all:
node core time (s)
1 1 479
1 8 269
2 8 284
3 8 284
As this was not in agreement (i) with my expectations and (ii) with the results presented by James Freels at the COMSOL Conference, I contacted him. On his advice I built another test case designed to spend more time in the matrix factorization part where parallelism is efficiently implemented. The case is smaller than the previous one (1.06 MDoFs), the 3D geometry is basic but the coupled system of diffusion equations is now non linear. The shared memory speed-up is still not that obtained in Monte Carlo computations but this is finite elements:
node core time (s) memory (Gb)
1 1 9059 32
1 4 2530 32
1 8 1493 32
2 8 923 18
3 8 725 16
4 8 573 11
Best regards,
Alain Glière
Coming back to the initial post by Vu Le, I would like to share some timing results obtained on our Linux RedHat cluster pointing out the importance of the choice of the test case.
For some time, I got the same kind of disappointing results as those of Vu Le. The test case I used was a linear diffusion equation with a very fine mesh and 8 coupled unknowns, altogether 6.7 MDoFs. The solver used was MUMPS, adapted to shared memory computation. The test case looked fine but the results showed no shared memory improvement at all:
node core time (s)
1 1 479
1 8 269
2 8 284
3 8 284
As this was not in agreement (i) with my expectations and (ii) with the results presented by James Freels at the COMSOL Conference, I contacted him. On his advice I built another test case designed to spend more time in the matrix factorization part where parallelism is efficiently implemented. The case is smaller than the previous one (1.06 MDoFs), the 3D geometry is basic but the coupled system of diffusion equations is now non linear. The shared memory speed-up is still not that obtained in Monte Carlo computations but this is finite elements:
node core time (s) memory (Gb)
1 1 9059 32
1 4 2530 32
1 8 1493 32
2 8 923 18
3 8 725 16
4 8 573 11
Best regards,
Alain Glière