The Fortran code used for the modeling was originally written by
David Bryan, a participant in SPUR '94, under the guidance of Dr. Lantz.
To take a look at his project, click
here.
By running simulations of different models, researchers can attempt to
match their models to reality. Simulations also show how changes in
parameters affect the behavior of the model. However, the more
complicated the model becomes, the longer a program that simulates
the model will take to run. Also, more complex programs tend to require
more computer memory. These are both roadblocks to testing complex models.
Parallel computing offers an interesting solution to both problems posed
above. By distributing the work load to many processors,
the amount of time spent by each processor computing decreases.
If the computer is a distributed memory machine, like the IBM SP2 at the
Cornell Theory Center, the amount of data stored in each node can be
reduced.
Because of the possibility of running simulations on data sets too large
or time consuming for a single computer, we attempted to write
a parallel program that would serve as a modifiable basis for running
different models of the solar convection zone on a parallel machine.
To test our code, we used a basic model that did not harness the full
power of running a job in parallel.
Below is the section of the convection zone we are considering, with the
rest of the sun removed. A further approximation is imposed that
there is only two-dimensional fluid motion, that is, no fluid moves
parallel to the axis of the cylinder. This is a good approximation if
the cylinder is rotating quickly enough about its axis (Busse and Or,
JFM, 1986). Thus, the entire cylinder can be represented by one cross
section.
Another assumption is that all fields will be periodic along the disk.
Then, the full disk can be represented by a section containing
a single period.
This section can then be approximated by a rectangle.
It is on this rectangle
that all the calculations are done in our simulation.
tells how a piece of fluid moves when there is a force
acting on it; this is the basic equation of fluid flow.
describes how heat flows into or out of a moving piece of fluid.
Notice that this equation also involves the velocity.
says that the amount of fluid coming into a point is
equal to the amount of fluid leaving that point.
relates the density of a piece of fluid to the temperature of
that piece of fluid. Notice that since the Navier-Stokes equation
depends on the density, by this equation it also depends on the
temperature.
Since our equations are differential equations, we need to specify what
the edge conditions are in order to solve them.
For the left and right edges of our rectangle, we require that the
temperature and velocity fields must match on both sides.
This is because the rectangle is a periodic section of the disk.
For the top and bottom edges, we require that the velocity of the fluid
is parallel to the boundary. This keeps the fluid from flowing out of the
top or bottom of the rectangle. For the temperature, we keep
it fixed at the top and bottom corresponding to given outside and
inside temperatures.
The temporal components of our problem are also discrete.
We have the equations set up to compute what the
fields are at given time step intervals. As with the spatial
approximation, variations in the fields that occur in a shorter time
period than the time step are not observed. Therefore, the time step size
needs to be selected carefully, as with the control area.
Because our fields are discrete, taking the derivatives poses a problem.
Recalling from calculus, the derivative is defined by taking a limit:
However, on a discrete grid there are no points arbitrarily close to each
grid point, so the actual derivative cannot be calculated. Approximations
to the actual derivatives can be computed using Taylor's Theorem.
By applying Taylor's Theorem, this formula can be shown to be
second-order accurate in the grid spacing.
The second derivative is found by taking the derivative of the derivative.
While asymmetric finite difference approximations are used for both
derivatives, the result is symmetric.
Because the Navier-Stokes and the heat equations can give the explicit
time derivatives for the fields, provided that the values of the fields
at the current time are known, this
integration method could be used to advance the field through time.
The asymmetrical derivative approximation used above (inside the second
derivative) can be manipulated to give us an approximation to the time
derivative integration:
The above is a Runge-Kutta method, accurate only to first order by
Taylor's Theorem. More accuracy is
available from using a higher order method, at the cost of being more
computationally expensive. By experimentation, we found that a
third-order Runge-Kutta method was stable and accurate enough for our
problem. Below are the general equations for a third-order Runge-Kutta:
In addition, the following relations between coefficients must be
satisfied for the method to be third-order accurate:
Notice that there are 8 variables and only 6 equations. This allows
for classes of solutions to these equations.
Using a method that uses fewer working arrays could help to optimize our
program for memory usage.
Dr. Lantz proposed a method that uses only the field array plus just
one working array.
The structure of an algorithm that allows this is:
The extra equations that have to be satisfied for this to work,
besides the 6 requirements mentioned previously, are the
following:
Using Maple, we solved the set of equations for the m's, a's, b's and
c's after applying some extra conditions.
Our extra conditions were to make as many coefficients as possible
in the above Runge-Kutta algorithm equal to zero or one.
When zero, no computation is required with the corresponding term.
When one, no floating point multiplication is required.
Each case could lead to increased program performance.
It turned out that some combinations gave valid solutions,
while others did not.
The results from solving the equations for are given
here.
But because of the various partial derivatives that must be computed, there
is coupling between points on the grid. In most of the equations,
derivatives are computed by finite difference formulas that depend only on
neighboring points on the grid. So it seems that neighbors should be kept
together as much as possible when assigning grid points to processors.
However, there is also an elliptic equation (a Poisson equation) to be
solved. When discretized, the Poisson equation becomes a large matrix
equation, coupling together all the points along a row or column of the
grid. This complicates the choices for how to divide up the grid among
processors. We decided to cut our domain into horizontal strips, so that
all data coupled horizontally would be present together on the same
processor.
We set out to create a data-parallel program, in which processors are doing
computations on different parts of the domain. So the first issue to
address is: how should distinct strips of data be sent out to the different
processors? It is very convenient to do this in MPI, because such strip
initializations can be done with a single call to MPI_Scatter at the start
of the job (figure). The inverse of
this operation, MPI_Gather, is similarly useful for collecting data back to
a single node. In our program, the gather operation occurs every time an
output file is to be written; such output events occur at the end of a run
and at regular intervals during the run.
Communication between processors must also occur at very frequent intervals
during a run, in order to advance the code in time. There are two types of
operations requiring communication. First, for computing derivatives in the
time-evolution equations, data from neighboring points is needed; and it
sometimes happens that these neighboring points reside on another processor,
necessitating communication. Second, a communication pattern involving all
the processors together is needed for solving the previously-mentioned
global, elliptic equation. Parallelizing the elliptic solver is the tougher
of the problems, so we will discuss it in detail in the next section. For
the moment, we'll look only at the nearest-neighbor communications.
MPI provides a natural way for neighboring processors to communicate with
each other, called MPI_Sendrecv. The sendrecv operation sends the data from
a buffer in the local processor across to the designated processor, while at
the same time receiving data from that processor (or another one) into a
different buffer. With MPI_Sendrecv, data can be swapped between processors
and received into guard cells, as we see in this figure:
The operation needs to be done for each of the computational arrays. In
practice, it is more efficient to pack a special buffer with all the
necessary data from different arrays (temperature, vorticity,
streamfunction) and do just one sendrecv operation (rather than two) in
order to cut down on latency overhead--where latency is defined to be
the cost to initiate a new message.
A special case arises when the strip has a physical boundary along one edge,
because no communication is required. Instead, guard cell values are
computed in the following way: imagine the physical boundary of the system
to lie just beyond the last row of grid points. In particular, fix the
location of the physical boundary to be exactly between the final row of
data points and row of the guard cells next to it. Then, each guard cell
value is defined such that the value halfway in between interpolates
to the proper, physical boundary value. (The boundary condition on the
temperature perturbation, e.g., is such that its value must be zero
on the physical boundary.) The interpolating polynomial used for this
procedure is second-order, so values from the last two rows of data, as well
as the value at the physical boundary, are involved.
In our model, the time evolution of the 2D velocity field is not computed in
the direct way from the Navier-Stokes equation. Rather, we manipulate our
fluid equations mathematically so that we can deal instead with a quantity
called the vorticity (omega), defined as the vector curl of
the velocity. In 2D, the vorticity becomes a scalar, and it is therefore a
more convenient quantity for computations than the two-component velocity
vector.
An evolution equation for the vorticity can be derived by taking the curl of
the Navier-Stokes equation. But the velocity is still needed in all the
equations, so it somehow must be determined from the vorticity. This is
done by exploiting the fact that velocity is divergenceless in the
Boussinesq approximation. This implies that veolocity can be written as the
curl of yet another field, called the streamfunction (phi).
One can show that the vorticity and the streamfunction are related by
which is the Poisson equation. It is an elliptic partial differential
equation to be solved for phi, given omega. This phi
in turn determines the velocity: it is computed simply by taking derivatives
(or finite differences) of the streamfunction, once it becomes known.
Altogether, the computational procedure is as follows: first, the vorticity
is explicitly advanced in time; then, the Poisson equation for phi is
solved; then, velocities for the next time step are computed.
The first step is to Fourier-transform the above equation in the x
dimension. This is a natural enough thing to do, given the periodic
boundary conditions that apply in this direction. The transformed equation
is
where derivatives in x are simply replaced by multiples of ik.
This equation can be discretized very readily. It is already discrete in
k by virtue of the periodic boundary conditions, and the finite mesh
spacing has determined the maximum k. In the z-direction, the
remaining second derivative can be discretized using its finite-difference
approximation. The overall result is
which can be rewritten as
.
Notice that the equations are completely independent in k but are
coupled in the index i. This results in a set of independent
matrix problems to be solved, one for each k:
The corners of the matrix have been neglected here; in general they contain
special values that depend on the boundary conditions. The above matrix is
a special, sparse type of matrix called tridiagonal, for which very
fast algorithms exist (order N operations, to solve an N by N matrix).
Unfortunately, these fast algorithms are inherently sequential!
It seemed at first that the specified data movements could be done very
simply by using the MPI_Alltoall subroutine. In MPI_Alltoall, each
processor sends a message to, and receives a messages from, all the other
processors. In our case, this message should simply consist of a submatrix
of the original matrix. But it turns out that there are additional, local
transposes that need to be done on the submatrices in order to get the data
into the correct places, both before and after the MPI_Alltoall operation.
This is because MPI_Alltoall does not transpose any of the data within a
submatrix; it merely transposes submatrices intact. Furthermore,
MPI_Alltoall only works when each submatrix has its elements stored in
contiguous memory locations.
So, in place of the simple call to MPI_Alltoall (in steps 2 and 4 above),
the following three-step procedure has to be performed:
The global matrix transpose scheme can best be explained by a picture. (Dr. Lantz has also
created a color version.) Note
that the transpose procedure works equally well if the order of the steps is
reversed.
Table of Contents
Introduction
Our main objective was to parallelize and optimize a program that
simulated a particular model of solar convection. This required us
to learn about methods of parallelization, methods of optimization,
and to understand how and why the original program worked. The first
two sections of this report talk about the model and how it was
implemented in the original code. The third section deals with
how the code was optimized, while the last three sections discuss how
the code was parallelized.
Motivation
Direct observations of the sun indicate much is happening in the
convection zone. Most apparent are the granules, the tops of
convection cells. One can also observe supergranules, which consist
of a few hundred granules each. But it is unknown why the different
types of cells come in distinct sizes.
Method
Pictured below is a cutaway of the sun showing the convection zone.
This zone occupies the top 30% of the sun, where heat is
transported from the hot inside of the sun to the cooler outside
mainly by convection, instead of by radiative diffusion which is
the case deeper inside the sun. The only part of the zone we are
considering is the section represented by the shaded area.
We assume this section is a good approximation to the
actual shape of the convection zone in that area. We do not consider
what happens in the zone at the poles.
Equations
The fields that we are concerned with are the temperature and velocity
fields. For the Boussinesq convection model we are using,
those are the only
fields necessary. The equations below describe the restrictions on the
fields and how the fields evolve and interact.
The Navier-Stokes Equation
The Heat Equation
The Incompressible Fluid Equation (Boussinesq Approximation)
The Equation Of State
Edge Conditions
Derivative Approximations
Why Approximations Are Necessary
We approximate the fields by using arrays. The value in each box of the
array is the value of the field in a corresponding area of
the star. Let this area be called the control area.
Any variation in the actual fields of a smaller scale than
the control area is not seen in the array. Because of this, the
control area needs to be selected so it is smaller than the size of the
smallest important field variation.
Spatial Derivatives
We approximate spatial derivatives by using a symmetric finite difference
approximation. For the first derivative,
Temporal Derivative
In contrast to the spatial derivatives (where the field values on either
side of the present point are used to calculate the derivative), we
have the field value and time derivative at the current point and want
to calculate what the field will be at the next time step. This amounts
to performing an integration of the time derivative.
Optimization
Since we are updating each part of the entire field at the same time,
each one of the temporary variables used in the Runge-Kutta method above must be
taken to be a temporary array, of the same size as our field array.
This would require quite a bit of extra memory in our program.
Parallelization
Domain Decomposition
A major goal of this project was to take the
original program, now optimized to run on a single processor, and to
parallelize it. The main thing to consider when parallelizing a program is
how to divide the tasks over the processors. The most straightforward way
to do it in this case is by domain decomposition, i.e. by assigning
different chunks of the computational arrays to different processors. Each
processor is then responsible for computing its own part of the simulated
space. To be efficient, these computations should be kept as independent as
possible.
Interprocessor Communication Using MPI
A parallel program requires extra software to allow communication between
the independently running processors. For the parallel implementation of
the code, we selected the MPI message passing interface to handle the
interprocessor communication. The MPI standard includes many types of
communication patterns, several of which are particularly well-suited to our
application.
Communication of the Boundary Values
On each processor, a horizontal strip of data is surrounded by a border of
extra grid points called "guard cells". The purpose of the guard cells is
to handle boundary conditions (on exterior edges) or to serve as
communication buffers (along interior partitions). The boundary conditions
are periodic in the horizontal direction, so guard cells can be assigned
locally, by wrapping around values from the opposite, matching side of the
strip. On the vertical edges, guard cells are assigned in one of two ways.
Where the edge corresponds to a physical boundary of the system, the
physical boundary conditions are applied. But along an interior partition,
updated information from a different processor must be brought in at regular
intervals, in order to compute the derivatives in the usual way.
Elliptic Equation
Computational Procedure for Solving the Poisson Equation
Clearly, the Poisson equation for the streamfunction must be solved every
time the vorticity is updated. In effect, one must invert the Laplacian
operator in the above equation. There are numerous ways to do this in an
approximate fashion. In this section, we outline just one possibility,
which is the one we actually implemented in the code.
Strategy for Parallelization
It might seem almost pointless to try to parallelize the above scheme.
Probably it would be, were it not for two additional facts:
This mutual independence admits the possibility of a parallel solution.
Dr. Lantz suggested the following procedure (where the leading "//" symbol
marks operations that can be done in parallel):
Results
After parallelizing the code, it was time to try out some simulation runs to
see if it was working properly. In particular, we wanted to create
visualizations of the simulation results. The tool we used to help us
visualize our data was IBM's Data Explorer. DX allowed us to render a
sequence of images, representing the fluid variables at fixed moments in
time. These frames could then could be concatenated into an animation. Here is an MPEG movie showing a typical
thermal Rossby wave, starting off from a stationary state (plus some
"noise") and then settling into steady traveling motion. Contours of the
streamfunction, shown as black lines, represent streamlines of the flow,
while the color shading corresponds to the fluctuating part of the perturbed
temperature (i.e., what's left of the temperature perturbation after
a purely conductive profile has been subtracted off).
Overall, we succeeded in meeting the main goals of this project, which were: