CIS 4205 - Course Page

Cornell University
Computing and Information Science 4205:
Effective Use of High Performance Computing

Steve Lantz, Instructor for Spring 2009

Course Location and Time: Upson 205, MWF 10:10–11:00	Duration: 8 weeks, Jan. 19–Mar. 13
Credits: 2 Credit Hours, student option Successful/Unsuccessful or Letter Grade	Prerequisites: Proficiency in C, C++, Fortran, or Fortran 90/95
Recommended Text: Parallel Programming in C with MPI and OpenMP, by Michael J. Quinn, on reserve at the Engineering Library (preview available on Google Book Search)	Office Hours: 533 Rhodes Hall, by appointment Send email to slantz@cac.cornell.edu
Grading and Policies: follow this link
Synopsis: A hands-on introduction to high-performance computing (HPC) for graduate students or advanced undergraduate students who will use HPC as a tool in their work. Various HPC architectural platforms are described with a focus on computational clusters. Students learn how to identify and exploit parallelism in algorithms and legacy applications; how to measure parallel speedup and efficiency; and how to diagnose bottlenecks affecting performance. Parallel programming with MPI, OpenMP, and task-farming techniques (for Web services and grid computing) is covered in depth. Examples and assignments are taken from typical application areas such as matrix and Monte Carlo computations. The goal of the class is for students to gain practical HPC experience for use in their specific fields.

Course Outline and Lecture Notes:
(*** UNDER CONSTRUCTION ***)
The outline is tentative and may be modified—students will be given a voice in what is ultimately presented.

(Note: Lectures may contain concepts and examples from the textbook and other sources listed below)

Week 1: Introduction to High Performance Computing
Lecture Notes ( ppt, pdf )

Week 2: Building and Running Parallel Applications, Pt. 1;
Parallel Application Design
Lecture Notes ( ppt, pdf )

Week 3: Building and Running Parallel Applications, Pt. 2;
Distributed Memory Programming Using Basic MPI
Lecture Notes ( ppt, pdf )

Week 4: Distributed Memory Programming Using Advanced MPI
Lecture Notes ( ppt, pdf )

Weeks 5 and 6: In-Class “Guerrilla” Development of MPI Examples
Shared Memory Programming with Basic and Advanced OpenMP
Lecture Notes ( ppt, pdf )

Weeks 7 and 8: Parallel Performance
Lecture Notes ( ppt, pdf )

Sample Codes:

Week 2:
helloworld.c, heat2d_ser.c, heatmovie.gif

Week 3:
api_examples.c, roundtrip.c
Remove .txt from the filename to use any of the following shell scripts:
batch_test.sh.txt, one_node.sh.txt, two_nodes.sh.txt

Week 4:
pi.c, gather.c, gather2.c, alltoall.c, tp1.c, bagboy.c, pp1.c

Week 5:
montepi.c, fullmonte.sh.txt, mw_skel.c

Week 6:
omp_pi.c, omp_firstpriv.c, omp_lastpriv.c, omp_single.c

Week 7:
Assignment #3 solutions: bagboys.c, bagboys_better.c
omp_montepi.c
Assignment #4 solutions: naive.c, omp_naive.c, sieve.c, omp_sieve.c

Week 8:
omp_split.c, omp_quinn1.c, omp_quinn2.c, omp_quinn3.c

Quizzes, Assignments, and Project Milestones:

QUIZ #1 ANNOUNCED for Wed. 1/28 based on Week 1 Lecture Notes!

ASSIGNMENT #1 DUE 1/30
Download the gzipped tar file from Tutorial 7 at http://www.ee.surrey.ac.uk/teaching/unix/index.html
and build the code within it on any Linux computer by following the steps delineated in 7.2–7.6. Then:

Run the code and email me all the output you get.
For input, use the numerical part of your Cornell NetID as the number of feet.
Example: if your NetID is abc6, your input would be “6 feet”.
Ask for output in meters.
In the process, you will learn how to run configure and make!

PROJECT MILESTONE #1 DUE 2/6
Propose a problem in parallel computing that you would like to solve as an outcome of this course.
It should involve the following elements:

Designing a parallel program (due at the end of week 5)
Writing a proof-of-principle code (due at the end of week 7)
Verifying that your code works (due at the end of week 8)
It should not be so simple that you can look it up in a book
It should not be so hard that it’s equivalent to a Ph.D. thesis project
You will be able to seek help from me and your classmates!
Take this as an opportunity to work on something you care about

ASSIGNMENT #2 DUE 2/11
Build the roundtrip.c example using mpicc:

Try to understand how this code works
Compile, then run the executable in batch using 1 and 8 nodes of v4
For these node counts, use -ppn to run with 8 and 64 total processes
Change the number of megabytes sent in the code, rebuild, and rerun 4 processes on 4 nodes
Email me the output from the above

QUIZ #2 ANNOUNCED for Mon. 2/16 based on Week 2 Lecture Notes!

ASSIGNMENT #3 DUE 2/18
Rewrite the bagboy.c code to create a grocery store you’d much rather shop in!

In the current code, customers (workers) toss random items at one poor overworked bag boy (master)
Let’s put the customers in charge! Reverse roles so the customers now arrive in the master process
Each of the np-1 workers is now a bag boy who will devote himself to just one customer at a time
Rearrange the program so an idle bag boy promptly takes the next customer
See how well the load is balanced as customers arrive with random numbers of items
Hints:
- Switch master and worker code sections between if and else clauses
- There will need to be another layer of messages from the bagboys to the checkout line (master)
- This reason for these messages is so the master can tell which bagboys are not busy
- See if the master can make use of MPI_IRecv and MPI_Waitany (may not be the easiest way)
- The bagboys will also need some way to know that there are no more customers in line
- Note: the number of customers no longer needs to equal the number of processes
- As a starting point, you may use the master-worker skeleton code we developed in class

PROJECT MILESTONE #2 DUE 2/20
Present a parallel program design that will solve the problem you described in your proposal:

Use Foster’s design principles to analyze your problem and identify sources of parallelism
Describe the basic type of parallelism that emerges from the applying the principles
Some things to remember as you discuss your design:
- A task to fetch 10 buckets of water is better thought of as 10 tasks to fetch 1 bucket of water
- Buckets are like data; it’s only functional parallelism if another worker mops simultaneously, e.g.
- Load balancing should be addressed in any parallel program design—how will you do it?
- For master-worker codes, there is always the question, will the master be a bottleneck?

ASSIGNMENT #4 DUE 3/1
Write serial and parallel versions of a code to find prime numbers using the “Sieve of Eratosthenes”:

You can use the following algorithm and Perl script as guides: sieve.pdf, sieve.pl.txt
Parallelize your serial “Sieve of Eratosthenes” code with OpenMP
Email me your codes along with a short explanation of your parallel design

PROJECT MILESTONE #3 DUE 3/6 - no late penalty if handed in by 3/13
Write a program based on the parallel algorithm design you created for your problem:

You can use MPI or OpenMP in C or Fortran—or your solution can even be script-based
Create a simple performance model showing where your solution generates parallel speedup
Use the model to estimate how much parallel speedup you should get compared to a serial program
Include timing calls on different code sections, as appropriate to your performance model
Submitted C or Fortran codes should compile with no compiler errors or warnings

QUIZ #3 ANNOUNCED for Mon. 3/9 based on MPI and OpenMP (functional knowledge, not syntax details)

ASSIGNMENT #5 - REVISED - WAS DUE IN CLASS 3/11, now may be submitted by 3/18, no late penalty
Run tests with the improved “Sieve of Eratosthenes” codes that were used as examples in class:

Begin with these example codes from class: omp_naive.c, omp_sieve.c omp_quinn3.c
The last of these incorporates the 3 improvements to omp_sieve.c suggested by Quinn:
- Deletes the even integers from the list, doesn't even bother to store them
- Saves parallel overhead by having EACH thread calculate all the primes from 2 to sqrt(N) -
  this is done by dividing the “parallel for” into two separate “parallel” and “for” pragmas
- Introduces a third-level outer loop, in order to improve the cache hit rate -
  takes one cache-size clump at a time, strikes ALL multiples of { primes < sqrt(N) } from it
Modify these codes and create batch scripts so that:
- New “wallclock” calls will measure the parallel time, inherently serial time, and total time
- For each code, timing data are generated for 1, 2, 3,... 8 threads on a v4 batch node
Plot the times, parallel speedup, and parallel efficiency vs. thread count for each code
Which code has the best performance? Best parallel efficiency? Is Amdahl's Law obeyed?
Extra Credit: Separate the “parallel for” in omp_quinn3.c and measure the effect from doing that too

Use the other OpenMP parallel construct in the code as your model
Alternatively, measure the effects from other changes you make (good or bad) to a code

Please fill out the course survey

PROJECT MILESTONE #4 DUE 3/13 - no late penalty if handed in by 3/23
Improve the parallel program you put together for your problem and demonstrate some results from it:

Find and eliminate bugs so your code works correctly (if it doesn’t, why not?)
Show that you're getting approximately the parallel speedup you expect (if not, why not?)
Create a five-minute presentation about your project, to be presented in class on Friday 3/13

Explain your simple performance model
Present evidence that your code solves your problem as intended
Any slides or programs for your presentation should be submitted ahead of the class

Quick Reference Links:

MPI Documents specifying the standard MPI interface
C99 Draft Standard, plain text format
Archive of this course’s website from previous years

Online Resources:

Chandra, Rohit, et al. Parallel Programming in OpenMP. Morgan Kaufmann Publishers, 2001. Books24x7.
Foster, Ian. Designing and Building Parallel Programs (Online). Addison-Wesley, 1995. Argonne/CRPC.
Petersen, Wesley, and Peter Arbenz. Introduction to Parallel Computing: A Practical Guide with
Examples in C. Oxford University Press, 2004. Books24x7.
Silva, Vladimir. Grid Computing for Developers. Cengage Charles River Media, 2006. Books24x7.
Snir, Marc, et al. MPI: The Complete Reference. MIT, 1995. Netlib.

Reference Texts:

[1] Using MPI: Portable Parallel Programming with the Message-Passing Interface
By: William Gropp, Ewing Lusk, & Anthony Skjellum
October 1994; ISBN 0-262-57104-8

W. Gropp, E. Lusk, and A. Skjellum,
Using MPI: Portable Parallel Programming with the Message-Passing Interface, MIT Press, 1999.
W. Gropp, E. Lusk, and R. Thakur,
Using MPI-2: Advanced Features of the Message-Passing Interface, MIT Press, 1999.

[2] Introduction to Parallel Computing
By: Ted G. Lewis, Hesham El-Rewini
Copyright 1992 - Prentice-Hall Inc.
ISBN: 0-13-498924-4

[3] An Introduction to Parallel Programming
By: K. Mani Chandy, Stephen Taylor
Copyright 1992 - Jones and Bartlett Publishers, Inc.
ISBN: 0-86720-208-4

[4] Designing and Building Parallel Programs, Concepts and Tools for Parallel Software Engineering
By: Ian Foster
Copyright 1995 - Addison-Wesley Publishing Company, Inc.
ISBN: 0-201-57594-9

[5] Introduction to Parallel and Vector Solution of Linear Systems
By: James M. Ortega
Copyright 1988 - Plenum Press
ISBN: 0-306-42862-8

[6] Parallel Sorting Algorithms
By: Selim G. Akl
Copyright 1995 - Academic Press, Inc
ISBN: 0-12-047680-0

[7] Portable Programs for Parallel Processors
By: James Patterson, Terrence Disz, Ralph Butler, Rick Stevens, Ross Overbeek, Barnett Glickfeld, Ewing Lusk and James Boyle
Copyright 1987 Holt, Rinehart and Winston, Inc.
ISBN: 0-03-014404-3

Questions? Comments? Email me, slantz@cac.cornell.edu
-- The background color for this page is RGB #caccac (what else?) --
Last updated 3/11/09

Cornell University Computing and Information Science 4205: Effective Use of High Performance Computing

Steve Lantz, Instructor for Spring 2009

Cornell University
Computing and Information Science 4205:
Effective Use of High Performance Computing