|
Cornell University
Computing and Information Science 4205:
Effective Use of High Performance Computing
Steve Lantz, Instructor for Spring 2009
|
Course Location and Time:
Upson 205, MWF 10:10–11:00
|
Duration:
8 weeks, Jan. 19–Mar. 13
|
|
Credits:
2 Credit Hours, student option
Successful/Unsuccessful or Letter Grade
|
Prerequisites:
Proficiency in C, C++,
Fortran, or Fortran 90/95
|
|
Recommended Text:
Parallel Programming in C with MPI and OpenMP, by
Michael J. Quinn, on reserve at the Engineering Library
(preview available on Google Book Search)
|
Office Hours:
533 Rhodes Hall, by appointment
Send email to slantz@cac.cornell.edu
|
|
Grading and Policies: follow this link
|
Synopsis:
A hands-on introduction to high-performance computing (HPC) for graduate students or advanced
undergraduate students who will use HPC as a tool in their work. Various HPC architectural
platforms are described with a focus on computational clusters. Students learn how to identify and
exploit parallelism in algorithms and legacy applications; how to measure parallel speedup and
efficiency; and how to diagnose bottlenecks affecting performance. Parallel programming with MPI,
OpenMP, and task-farming techniques (for Web services and grid computing) is covered in depth.
Examples and assignments are taken from typical application areas such as matrix and Monte Carlo
computations. The goal of the class is for students to gain practical HPC experience for use in
their specific fields.
|
Course Outline and Lecture Notes:
(*** UNDER CONSTRUCTION ***)
The outline is tentative and may be modified—students will be given a
voice in what is ultimately presented.
(Note: Lectures may contain concepts and examples from the textbook and other sources
listed below)
Week 1:
Introduction to High Performance Computing
Lecture Notes ( ppt,
pdf )
Week 2:
Building and Running Parallel Applications, Pt. 1;
Parallel Application Design
Lecture Notes ( ppt,
pdf )
Week 3:
Building and Running Parallel Applications, Pt. 2;
Distributed Memory Programming Using Basic MPI
Lecture Notes ( ppt,
pdf )
Week 4:
Distributed Memory Programming Using Advanced MPI
Lecture Notes ( ppt,
pdf )
Weeks 5 and 6:
In-Class “Guerrilla” Development of MPI Examples
Shared Memory Programming with Basic and Advanced OpenMP
Lecture Notes ( ppt,
pdf )
Weeks 7 and 8:
Parallel Performance
Lecture Notes ( ppt,
pdf )
Sample Codes:
Week 2:
helloworld.c,
heat2d_ser.c,
heatmovie.gif
Week 3:
api_examples.c,
roundtrip.c
Remove .txt from the filename to use any of the following shell scripts:
batch_test.sh.txt,
one_node.sh.txt,
two_nodes.sh.txt
Week 4:
pi.c, gather.c,
gather2.c, alltoall.c,
tp1.c, bagboy.c,
pp1.c
Week 5:
montepi.c,
fullmonte.sh.txt,
mw_skel.c
Week 6:
omp_pi.c,
omp_firstpriv.c,
omp_lastpriv.c,
omp_single.c
Week 7:
Assignment #3 solutions: bagboys.c,
bagboys_better.c
omp_montepi.c
Assignment #4 solutions: naive.c,
omp_naive.c,
sieve.c,
omp_sieve.c
Week 8:
omp_split.c,
omp_quinn1.c,
omp_quinn2.c,
omp_quinn3.c
Quizzes, Assignments, and Project Milestones:
QUIZ #1 ANNOUNCED for Wed. 1/28 based on Week 1 Lecture Notes!
ASSIGNMENT #1 DUE 1/30
Download the gzipped tar file from Tutorial 7 at
http://www.ee.surrey.ac.uk/teaching/unix/index.html
and build the code within it on any Linux computer by following the steps delineated in 7.2–7.6. Then:
- Run the code and email me all the output you get.
- For input, use the numerical part of your Cornell NetID as the number of feet.
- Example: if your NetID is abc6, your input would be “6 feet”.
- Ask for output in meters.
- In the process, you will learn how to run configure and make!
PROJECT MILESTONE #1 DUE 2/6
Propose a problem in parallel computing that you would like to solve as an outcome of this course.
It should involve the following elements:
- Designing a parallel program (due at the end of week 5)
- Writing a proof-of-principle code (due at the end of week 7)
- Verifying that your code works (due at the end of week 8)
- It should not be so simple that you can look it up in a book
- It should not be so hard that it’s equivalent to a Ph.D. thesis project
- You will be able to seek help from me and your classmates!
- Take this as an opportunity to work on something you care about
ASSIGNMENT #2 DUE 2/11
Build the roundtrip.c example using mpicc:
- Try to understand how this code works
- Compile, then run the executable in batch using 1 and 8 nodes of v4
- For these node counts, use -ppn to run with 8 and 64 total processes
- Change the number of megabytes sent in the code, rebuild, and rerun 4 processes on 4 nodes
- Email me the output from the above
QUIZ #2 ANNOUNCED for Mon. 2/16 based on Week 2 Lecture Notes!
ASSIGNMENT #3 DUE 2/18
Rewrite the bagboy.c code to create a grocery store you’d much rather shop in!
- In the current code, customers (workers) toss random items at one poor overworked bag boy (master)
- Let’s put the customers in charge! Reverse roles so the customers now arrive in the master process
- Each of the np-1 workers is now a bag boy who will devote himself to just one customer at a time
- Rearrange the program so an idle bag boy promptly takes the next customer
- See how well the load is balanced as customers arrive with random numbers of items
- Hints:
- Switch master and worker code sections between if and else clauses
- There will need to be another layer of messages from the bagboys to the
checkout line (master)
- This reason for these messages is so the master can tell which bagboys are not busy
- See if the master can make use of MPI_IRecv and MPI_Waitany (may not be the easiest way)
- The bagboys will also need some way to know that there are no more customers in line
- Note: the number of customers no longer needs to equal the number of processes
- As a starting point, you may use the master-worker skeleton code
we developed in class
PROJECT MILESTONE #2 DUE 2/20
Present a parallel program design that will solve the problem you described in your proposal:
- Use Foster’s design principles to analyze your problem and identify sources of parallelism
- Describe the basic type of parallelism that emerges from the applying the principles
- Some things to remember as you discuss your design:
- A task to fetch 10 buckets of water is better thought of as 10 tasks to fetch 1 bucket of water
- Buckets are like data; it’s only functional parallelism if another worker mops simultaneously, e.g.
- Load balancing should be addressed in any parallel program design—how will you do it?
- For master-worker codes, there is always the question, will the master be a bottleneck?
ASSIGNMENT #4 DUE 3/1
Write serial and parallel versions of a code to find prime numbers using the “Sieve of Eratosthenes”:
- You can use the following algorithm and Perl script as guides:
sieve.pdf, sieve.pl.txt
- Parallelize your serial “Sieve of Eratosthenes” code with OpenMP
- Email me your codes along with a short explanation of your
parallel design
PROJECT MILESTONE #3 DUE 3/6 - no late penalty if handed in by 3/13
Write a program based on the parallel algorithm design you created for your problem:
- You can use MPI or OpenMP in C or Fortran—or your solution can even be script-based
- Create a simple performance model showing where your solution generates parallel speedup
- Use the model to estimate how much parallel speedup you should get compared to a serial program
- Include timing calls on different code sections, as appropriate to your performance model
- Submitted C or Fortran codes should compile with no compiler errors or warnings
QUIZ #3 ANNOUNCED for Mon. 3/9 based on MPI and OpenMP (functional knowledge, not syntax details)
ASSIGNMENT #5 - REVISED - WAS DUE IN CLASS 3/11, now may be submitted by 3/18, no late penalty
Run tests with the improved “Sieve of Eratosthenes” codes that were used as examples in class:
- Begin with these example codes from class: omp_naive.c,
omp_sieve.c
omp_quinn3.c
- The last of these incorporates the 3 improvements to omp_sieve.c suggested by Quinn:
- Deletes the even integers from the list, doesn't even bother to store them
- Saves parallel overhead by having EACH thread calculate all the primes from 2 to sqrt(N) -
this is done by dividing the “parallel for” into two separate “parallel”
and “for” pragmas
- Introduces a third-level outer loop, in order to improve the cache hit rate -
takes one cache-size clump at a time, strikes ALL multiples of { primes < sqrt(N) } from it
- Modify these codes and create batch scripts so that:
- New “wallclock” calls will measure the parallel time, inherently serial time, and
total time
- For each code, timing data are generated for 1, 2, 3,... 8 threads on a v4 batch node
- Plot the times, parallel speedup, and parallel efficiency vs. thread count for each code
- Which code has the best performance? Best parallel efficiency? Is Amdahl's Law obeyed?
- Extra Credit: Separate the “parallel for” in
omp_quinn3.c and measure the effect from doing that too
- Use the other OpenMP parallel construct in the code as your model
- Alternatively, measure the effects from other changes you make (good or bad) to a code
Please fill out the course survey
PROJECT MILESTONE #4 DUE 3/13 - no late penalty if handed in by 3/23
Improve the parallel program you put together for your problem and demonstrate some results from it:
- Find and eliminate bugs so your code works correctly (if it doesn’t, why not?)
- Show that you're getting approximately the parallel speedup you expect (if not, why not?)
- Create a five-minute presentation about your project, to be presented in class
on Friday 3/13
- Explain your simple performance model
- Present evidence that your code solves your problem as intended
- Any slides or programs for your presentation should be submitted ahead of the class
Quick
Reference Links:
MPI Documents specifying the standard MPI interface
C99 Draft Standard, plain text format
Archive of this course’s website from previous years
Online Resources:
- Chandra, Rohit, et al.
Parallel
Programming in OpenMP. Morgan Kaufmann Publishers, 2001. Books24x7.
- Foster, Ian.
Designing and Building
Parallel Programs (Online). Addison-Wesley, 1995. Argonne/CRPC.
- Petersen, Wesley, and Peter Arbenz.
Introduction
to Parallel Computing: A Practical Guide with
Examples in C. Oxford University Press, 2004. Books24x7.
- Silva, Vladimir.
Grid
Computing for Developers. Cengage Charles River Media, 2006. Books24x7.
- Snir, Marc, et al.
MPI:
The Complete Reference. MIT, 1995. Netlib.
Reference Texts:
[1] Using MPI:
Portable Parallel Programming with the Message-Passing Interface
By: William Gropp,
Ewing Lusk, & Anthony Skjellum
October 1994; ISBN
0-262-57104-8
-
W. Gropp, E. Lusk, and A. Skjellum,
Using MPI:
Portable Parallel Programming with the Message-Passing
Interface, MIT Press, 1999.
-
W. Gropp, E. Lusk, and R. Thakur,
Using MPI-2:
Advanced Features of the Message-Passing Interface, MIT Press,
1999.
[2] Introduction
to Parallel Computing
By: Ted G. Lewis, Hesham El-Rewini
Copyright 1992 -
Prentice-Hall Inc.
ISBN: 0-13-498924-4
[3] An
Introduction to Parallel Programming
By: K. Mani Chandy, Stephen
Taylor
Copyright 1992 - Jones and Bartlett Publishers, Inc.
ISBN:
0-86720-208-4
[4] Designing and
Building Parallel Programs, Concepts and Tools for Parallel Software
Engineering
By: Ian Foster
Copyright 1995 - Addison-Wesley Publishing
Company, Inc.
ISBN: 0-201-57594-9
[5] Introduction
to Parallel and Vector Solution of Linear Systems
By: James M.
Ortega
Copyright 1988 - Plenum Press
ISBN: 0-306-42862-8
[6] Parallel
Sorting Algorithms
By: Selim G. Akl
Copyright 1995 - Academic Press,
Inc
ISBN: 0-12-047680-0
[7] Portable
Programs for Parallel Processors
By: James Patterson, Terrence Disz, Ralph
Butler, Rick Stevens, Ross Overbeek, Barnett Glickfeld, Ewing Lusk and James
Boyle
Copyright 1987 Holt, Rinehart and Winston, Inc.
ISBN:
0-03-014404-3
Questions? Comments? Email me,
slantz@cac.cornell.edu
-- The background color for this page is RGB #caccac (what else?) --
Last updated 3/11/09