GPU Teaching Kit - Accelerated Computing

Home Page

This page is the syllabus for the NVIDIA/UIUC Accelerated Computing Teaching Kit and outlines each module's organization in the downloaded Teaching Kit .zip file. It shows the content and the associated file names for every module as well as a link to the suggested online Deep Learning Institute (DLI) content for each module. You will also find links to stream lecture (.mp4) video files.

Module 1: Course Introduction
In this module we review course goals and syllabus and introduce the concepts of heterogeneous and parallel programming.
 Lectures:
  1.1 Course Introduction and Overview
   pdfLecture-1-1-overview.pdf
   pptxLecture-1-1-overview.pptx
   Video Lecture
  1.2 Introduction to Heterogeneous Parallel Computing
   pdfLecture-1-2-heterogeneous-computing.pdf
   pptxLecture-1-2-heterogeneous-computing.pptx
   Video Lecture
  1.3 Portability and Scalability in Heterogeneous Parallel Computing
   pdfLecture-1-3-portability-scalability.pdf
   pptxLecture-1-3-portability-scalability.pptx
   Video Lecture
 Book Chapters:
   Chapter 1 - Introduction: 3rd-Edition-Chapter01-introduction.pdf

Module 2: Introduction to CUDA C
In this module we cover the basic API functions in CUDA host code and introduce CUDA threads, the main mechanism for exploiting data parallelism.
 Lectures:
  2.1 CUDA C vs. Thrust vs. CUDA Libraries
   pdfLecture-2-1-cuda-thrust-libs.pdf
   pptxLecture-2-1-cuda-thrust-libs.pptx
   Video Lecture
  2.2 Memory Allocation and Data Movement API Functions
   pdfLecture-2-2-cuda-data-allocation-API.pdf
   pptxLecture-2-2-cuda-data-allocation-API.pptx
   Video Lecture
  2.3 Threads and Kernel Functions
   pdfLecture-2-3-cuda-parallelism-threads.pdf
   pptxLecture-2-3-cuda-parallelism-threads.pptx
   Video Lecture
  2.4 Introduction to the CUDA Toolkit
   pdfLecture-2-4-cuda-toolkit.pdf
   pptxLecture-2-4-cuda-toolkit.pptx
   Video Lecture
  2.5 Nsight Compute and NSight Systems
   pdfLecture-2-5-nsight-systems-compute.pdf
   pptxLecture-2-5-nsight-systems-compute.pptx
   Video Lecture
  2.6 Unified Memory
   pdfLecture-2-6-unified-memory.pdf
   pptxLecture-2-6-unified-memory.pptx
   Video Lecture
 Labs:
   Device Query: Module[2]-DeviceQuery.pdf
   CUDA Toolkit: Lab-2.4.cuda-toolkit.zip
 Quiz:
   Module 2 Quiz.pdf
 Book Chapters:
   Chapter 2 - Data Parallel Computing: 3rd-Edition-Chapter02-data-parallel-computing.pdf

Module 3: CUDA Parallelism Model
In this module we introduce the CUDA kernel, efficient memory access patterns, and thread scheduling.
 Lectures:
  3.1 Kernel-Based SPMD Parallel Programming
   pdfLecture-3-1-kernel-SPMD-parallelism.pdf
   pptxLecture-3-1-kernel-SPMD-parallelism.pptx
   Video Lecture
  3.2 Multidimensional Kernel Configuration
   pdfLecture-3-2-kernel-multidimension.pdf
   pptxLecture-3-2-kernel-multidimension.pptx
   Video Lecture
  3.3 Color-to-Grayscale Image Processing Example
   pdfLecture-3-3-color-to-greyscale-image-processing-example.pdf
   pptxLecture-3-3-color-to-greyscale-image-processing-example.pptx
   Video Lecture
  3.4 Image Blur Example
   pdfLecture-3-4-blur-kernel.pdf
   pptxLecture-3-4-blur-kernel.pptx
   Video Lecture
  3.5 Thread Scheduling
   pdfLecture-3-5-transparent-scaling.pdf
   pptxLecture-3-5-transparent-scaling.pptx
   Video Lecture
 Labs:
   NVIDIA DLI Online Course: Fundamentals of Accelerated Computing with CUDA C/C++, Section 1: Accelerating Applications with CUDA C/C++
   CUDA Image Blur: Module[3]-ImageBlur.pdf
   CUDA Image Color to Grayscale: Module[3]-ImageColorToGrayscale.pdf
   CUDA Thrust Vector Add: Module[3]-ThrustVectorAdd.pdf
   CUDA Vector Add: Module[3]-VectorAdd.pdf
 Quiz:
   Module 3 Quiz.pdf
 Book Chapters:
   Chapter 3 - Scalable Parallel Execution: 3rd-Edition-Chapter03-scalable-parallel-execution.pdf

Module 4: Memory and Data Locality
In this module we introduce the CUDA memory types and explore their effective use in tiled parallel algorithms.
 Lectures:
  4.1 CUDA Memories
   pdfLecture-4-1-cuda-memories.pdf
   pptxLecture-4-1-cuda-memories.pptx
   Video Lecture
  4.2 Tiled Parallel Algorithms
   pdfLecture-4-2-tiled-algorithms.pdf
   pptxLecture-4-2-tiled-algorithms.pptx
   Video Lecture
  4.3 Tiled Matrix Multiplication
   pdfLecture-4-3-tiled-matrix-multiplication.pdf
   pptxLecture-4-3-tiled-matrix-multiplication.pptx
   Video Lecture
  4.4 Tiled Matrix Multiplication Kernel
   pdfLecture-4-4-tiled-matrix-multiplication-kernel.pdf
   pptxLecture-4-4-tiled-matrix-multiplication-kernel.pptx
   Video Lecture
  4.5 Handling Arbitrary Matrix Sizes in Tiled Algorithms
   pdfLecture-4-5-tile-boundary-condition.pdf
   pptxLecture-4-5-tile-boundary-condition.pptx
   Video Lecture
 Labs:
   NVIDIA DLI Online Course: Fundamentals of Accelerated Computing with CUDA C/C++, Section 2: Managing Accelerated Application Memory with CUDA Unified Memory and nsys
   Basic Matrix Multiplication: Module[4]-BasicMatrixMultiplication.pdf
   CUDA Tiled Matrix Multiplication: Module[4]-TiledMatrixMultiplication.pdf
 Quiz:
   Module 4 Quiz.pdf
 Book Chapters:
   Chapter 4 - Memory and Data Locality: 3rd-Edition-Chapter04-memory-and-data-locality.pdf

Module 5: Thread Execusion Efficiency
In this module we explore how CUDA threads execute on SIMD Hardware and how to analyze the performance impact of control divergence.
 Lectures:
  5.1 Warps and SIMD Hardware
   pdfLecture-5-1-warps-simd.pdf
   pptxLecture-5-1-warps-simd.pptx
   Video Lecture
  5.2 Performance Impact of Control Divergence
   pdfLecture-5-2-control-divergence.pdf
   pptxLecture-5-2-control-divergence.pptx
   Video Lecture
 Quiz:
   Module 5 Quiz.pdf
 Book Chapters:
   Chapter 5 - Performance Considerations: 3rd-Edition-Chapter05-performance-considerations.pdf

Module 6: Memory Access Performance
In this module we explore the significance of memory coalescing to effectively utilize memory bandwidth in CUDA.
 Lectures:
  6.1 DRAM Bandwidth
   pdfLecture-6-1-dram-bandwidth.pdf
   pptxLecture-6-1-dram-bandwidth.pptx
   Video Lecture
  6.2 Memory Coalescing in CUDA
   pdfLecture-6-2-memory-coalescing.pdf
   pptxLecture-6-2-memory-coalescing.pptx
   Video Lecture
 Quiz:
   Module 6 Quiz.pdf
 Book Chapters:
   Chapter 5 - Performance Considerations: 3rd-Edition-Chapter05-performance-considerations.pdf

Module 7: Parallel Computation Patterns (Histogram)
In this module we introduce the parallel histogram computation pattern and learn to write a high performance kernel by privatizing outputs.
 Lectures:
  7.1 Histogramming
   pdfLecture-7-1-histogram.pdf
   pptxLecture-7-1-histogram.pptx
   Video Lecture
  7.2 Introduction to Data Races
   pdfLecture-7-2-data-race.pdf
   pptxLecture-7-2-data-race.pptx
   Video Lecture
  7.3 Atomic Operations in CUDA
   pdfLecture-7-3-CUDA-Atomic.pdf
   pptxLecture-7-3-CUDA-Atomic.pptx
   Video Lecture
  7.4 Atomic Operation Performance
   pdfLecture-7-4-atomic-performance.pdf
   pptxLecture-7-4-atomic-performance.pptx
   Video Lecture
  7.5 Privatization Technique for Improved Throughput
   pdfLecture-7-5-privatized-histogram.pdf
   pptxLecture-7-5-privatized-histogram.pptx
   Video Lecture
 Labs:
   Histogram: Module[7]-Histogram.pdf
   Text Histogram: Module[7]-TextHistogram.pdf
   Thrust Histogram Sort: Module[7]-ThrustHistogramSort.pdf
 Quiz:
   Module 7 Quiz.pdf
 Book Chapters:
   Chapter 11 - Parallel Patterns: Parallel Histogram Computation: 3rd-Edition-Chapter09-parallel-histogram-conputation.pdf

Module 8: Parallel Computation Patterns (Stencil)
In this module we introduce the tiled convolution pattern. We will learn to analyze the cost and benefit of tiled parallel convolution algorithms.
 Lectures:
  8.1 Convolution
   pdfLecture-8-1-convolution.pdf
   pptxLecture-8-1-convolution.pptx
   Video Lecture
  8.2 Tiled Convolution
   pdfLecture-8-2-tiled-convolution.pdf
   pptxLecture-8-2-tiled-convolution.pptx
   Video Lecture
  8.3 Tile Boundary Conditions
   pdfLecture-8-3-tile-boundary-condition.pdf
   pptxLecture-8-3-tile-boundary-condition.pptx
   Video Lecture
  8.4 Analyzing Data Reuse in Tiled Convolution
   pdfLecture-8-4-convolution-reuse.pdf
   pptxLecture-8-4-convolution-reuse.pptx
   Video Lecture
 Labs:
   Convolution: Module[8]-Convolution.pdf
   Stencil: Module[8]-Stencil.pdf
 Quiz:
   Module 8 Quiz.pdf
 Book Chapters:
   Chapter 7 - Parallel Patterns: Convolution: 3rd-Edition-Chapter07-convolution.pdf

Module 9: Parallel Computation Patterns (Reduction)
In this module we introduce the parallel reduction pattern.
 Lectures:
  9.1 Parallel Reduction
   pdfLecture-9-1-reduction.pdf
   pptxLecture-9-1-reduction.pptx
   Video Lecture
  9.2 A Basic Reduction Kernel
   pdfLecture-9-2-reduction-kernel.pdf
   pptxLecture-9-2-reduction-kernel.pptx
   Video Lecture
  9.3 A Better Reduction Kernel
   pdfLecture-9-3-better-reduction-kernel.pdf
   pptxLecture-9-3-better-reduction-kernel.pptx
   Video Lecture
 Labs:
   Reduction: Module[9]-Reduction.pdf
   Thrust Reduction: Module[9]-ThrustReduction.pdf
 Quiz:
   Module 9 Quiz.pdf
 Book Chapters:
   Chapter 5 - Performance Considerations: 3rd-Edition-Chapter05-performance-considerations.pdf

Module 10: Parallel Computation Patterns (Scan)
In this module we introduce the parallel scan (prefix sum) pattern.
 Lectures:
  10.1 Prefix Sum
   pdfLecture-10-1-scan-parallel-prefix-sum.pdf
   pptxLecture-10-1-scan-parallel-prefix-sum.pptx
   Video Lecture
  10.2 A Work-inefficient Scan Kernel
   pdfLecture-10-2-work-inefficient-scan-kernel.pdf
   pptxLecture-10-2-work-inefficient-scan-kernel.pptx
   Video Lecture
  10.3 A Work-Efficient Parallel Scan Kernel
   pdfLecture-10-3-work-efficient-scan-kernel.pdf
   pptxLecture-10-3-work-efficient-scan-kernel.pptx
   Video Lecture
  10.4 More on Parallel Scan
   pdfLecture-10-4-more-on-parallel-scan.pdf
   pptxLecture-10-4-more-on-parallel-scan.pptx
   Video Lecture
 Labs:
   List Scan: Module[10]-ListScan.pdf
   Thrust List Reduction: Module[10]-ThrustListScan.pdf
 Quiz:
   Module 10 Quiz.pdf
 Book Chapters:
   Chapter 9 - Parallel Patterns: PrefixSum: 3rd-Edition-Chapter08-prefix-sum.pdf

Module 11: Breadth-First (BFS) Queue
In this module we cover Breadth-First Search Queue.
 Labs:
   Breadth-First Search Queue: Module[11]-BfsQueue.pdf

Module 12: Floating-Point Considerations
In this module we introduce the fundmentals of floating-point representation.
 Lectures:
  12.1 Floating-Point Precision and Accuracy
   pdfLecture-12-1-floating-point-basics.pdf
   pptxLecture-12-1-floating-point-basics.pptx
   Video Lecture
  12.2 Numerical Stability
   pdfLecture-12-2-numerical-stability.pdf
   pptxLecture-12-2-numerical-stability.pptx
   Video Lecture
 Book Chapters:
   Chapter 6 - Numerical Considerations: 3rd-Edition-Chapter06-numerical-considerations.pdf

Module 13: GPU as Part of the PC Architecture
In this module we introduce how GPUs fit in the PC architecture.
 Lectures:
  13.1 GPU as Part of the PC Architecture
   pdfLecture-13-GPU-in-PC-Architecture.pdf
   pptxLecture-13-GPU-in-PC-Architecture.pptx
   Video Lecture
 Book Chapters:
   Chapter 18 - Programming a Heterogeneous Computing Cluster: 3rd-Edition-Chapter18-heterogeneous-cluster.pdf

Module 14: Efficient Host-Device Data Transfer
In this module we discuss important concepts involved in copying (transferring) data between host and device.
 Lectures:
  14.1 Pinned Host Memory
   pdfLecture-14-1-Data-Transfer.pdf
   pptxLecture-14-1-Data-Transfer.pptx
   Video Lecture
  14.2 Task Parallelism in CUDA
   pdfLecture-14-2-CUDA-Streams.pdf
   pptxLecture-14-2-CUDA-Streams.pptx
   Video Lecture
  14.3 Overlapping Data Transfer with Computation
   pdfLecture-14-3-Overlap-Transfer.pdf
   pptxLecture-14-3-Overlap-Transfer.pptx
   Video Lecture
  14.4 CUDA Unified Memory
   pdfLecture-14-4-cuda-unified-memory.pdf
   pptxLecture-14-4-cuda-unified-memory.pptx
   Video Lecture
 Labs:
   NVIDIA DLI Online Course: Fundamentals of Accelerated Computing with CUDA C/C++, Section 3: Asynchronous Streaming, and Visual Profiling for Accelerated Applications with CUDA C/C++
   Vector Addition Using CUDA Streams: Module[14]-VectorAdd_Stream.pdf
   Vector Addition Using Pinned Memory: Module[14]-PinnedMemoryStreamsVectorAdd.pdf
   CUDA Unified Memory Matrix Multiplication: Module[14]-UMMatrixMultiplication.pdf
 Quiz:
   Module 14 Quiz.pdf
 Book Chapters:
   Chapter 18 - Programming a Heterogeneous Computing Cluster: 3rd-Edition-Chapter18-heterogeneous-cluster.pdf
   Chapter 20 - More on CUDA and Grahpics Processing Unit Computing: 3rd-Edition-Chapter20-more-cuda-gpu-computing.pdf

Module 15: Application Case Study: Advanced MRI Reconstruction
In this module we introduce the MRI Reconstruction case study.
 Lectures:
  15.1 Advanced MRI Reconstruction
   pdfLecture-15-1-MRI-reconstruction.pdf
   pptxLecture-15-1-MRI-reconstruction.pptx
   Video Lecture
  15.2 Kernel Optimizations
   pdfLecture-15-2-MRI-kernel-optimization.pdf
   pptxLecture-15-2-MRI-kernel-optimization.pptx
   Video Lecture
 Book Chapters:
   Chapter 14 - Application Case Study - Non-Cartesian Magnetic Resonance Imaging: 3rd-Edition-Chapter14-case-study-MRI.pdf

Module 16: Application Case Study: Electrostatic Potential Calculation
In this module we introduce the Electrostatic Potential Calculation case study.
 Lectures:
  16.1 Electrostatic Potential Calculation - Part 1
   pdfLecture-16-1-VMD-case-study-Part1.pdf
   pptxLecture-16-1-VMD-case-study-Part1.pptx
   Video Lecture
  16.2 Electrostatic Potential Calculation - Part 2
   pdfLecture-16-2-VMD-case-study-Part2.pdf
   pptxLecture-16-2-VMD-case-study-Part2.pptx
   Video Lecture

Module 17: Computational Thinking for Parallel Programming
In this module we provide a framework for thinking about the problems of parallel programming
 Lectures:
  17.1 Introduction to Computational Thinking
   pdfLecture-17-1-Computational-Thinking.pdf
   pptxLecture-17-1-Computational-Thinking.pptx
 Book Chapters:
   Chapter 17 - Parallel Programming and Computational Thinking: 3rd-Edition-Chapter17-computational-thinking.pdf

Module 18: Related Programming Models: MPI
In this module we introduce the MPI programming model.
 Lectures:
  18.1 Introduction to Heterogeneous Supercomputing and MPI
   pdfLecture-18-MPI-CUDA-intro.pdf
   pptxLecture-18-MPI-CUDA-intro.pptx
 Book Chapters:
   Chapter 18 - Programming a Heterogeneous Computing Cluster: 3rd-Edition-Chapter18-heterogeneous-cluster.pdf

Module 19: CUDA Python using Numba
In this module we introduce CUDA Python using Numba.
 Labs:
   NVIDIA DLI Online Course: Fundamentals of Accelerated Computing with CUDA Python

Module 20: Related Programming Models: OpenCL
In this module we introduce the OpenCL programming model.
 Lectures:
  20.1 OpenCL Data Parallelism Model
   pdfLecture-20-1-opencl-parallelism.pdf
   pptxLecture-20-1-opencl-parallelism.pptx
  20.2 OpenCL Device Architecture
   pdfLecture-20-2-opencl-architecture.pdf
   pptxLecture-20-2-opencl-architecture.pptx
  20.3 OpenCL Host Code
   pdfLecture-20-3-opencl-host-code.pdf
   pptxLecture-20-3-opencl-host-code.pptx
 Labs:
   OpenCL Vector Addition: Module[20]-OpenCLVectorAddition.pdf
 Quiz:
   Module 20 Quiz.pdf
 Book Chapters:
   Appendix - An Introduction to OpenCL: 3rd-Edition-AppendixA-intro-to-OpenCL.pdf

Module 21: Related Programming Models: OpenACC
In this module we introduce the OpenACC programming model.
 Lectures:
  21.1 Introduction to OpenACC
   pdfLecture-21-1-openACC-intro.pdf
   pptxLecture-21-1-openACC-intro.pptx
   Video Lecture
  21.2 OpenACC Subtleties
   pdfLecture-21-2-openACC-subtleties.pdf
   pptxLecture-21-2-openACC-subtleties.pptx
   Video Lecture
 Labs:
   NVIDIA DLI Online Course: Fundamentals of Accelerated Computing with OpenACC
   OpenACC CUDA Vector Add: Module[21]-OpenACCVectorAdd.pdf
 Quiz:
   Module 21 Quiz.pdf
 Book Chapters:
   Chapter 15 - Parallel Programming with OpenACC: 3rd-Edition-Chapter19-programming-with-OpenACC.pdf

Module 22: Related Programming Models: OpenGL
In this module we introduce the OpenGL programming model.
(Module scheduled for a future relase of the teaching kit.)

Module 23: Dynamic Parallelism
In this module we introduce dynamic parallelism.
 Lectures:
  23.1 Dynamic Parallelism
   pdfLecture-23-Dynamic-parallelism.pdf
   pptxLecture-23-Dynamic-parallelism.pptx
   Video Lecture
 Labs:
   Dynamic Parallelism: Module[23]-DynamicParallelism.pdf
 Book Chapters:
   Chapter 13 - CUDA dynamic parallelism: 3rd-Edition-Chapter13-cuda-dynamic-parallelism

Module 24: Multi-GPU
In this module we discuss programming with multiple GPUs.
 Lectures:
  24.1 OpenMP
   pdfLecture-24-1-openmp.pdf
   pptxLecture-24-1-openmp.pptx
  24.2 Multi-GPU Introduction I
   pdfLecture-24-2-multi-gpu-introduction-i.pdf
   pptxLecture-24-2-multi-gpu-introduction-i.pptx
  24.3 Multi-GPU Introduction II
   pdfLecture-24-3-multi-gpu-introduction-ii.pdf
   pptxLecture-24-3-multi-gpu-introduction-ii.pptx
  24.4 OpenMP and Cooperative Groups
   pdfLecture-24-4-openmp-and-cooperative-groups.pdf
   pptxLecture-24-4-openmp-and-cooperative-groups.pptx
  24.5 Multi-GPU Heat Equation
   pdfLecture-24-5-multi-gpu-heat-equation.pdf
   pptxLecture-24-5-multi-gpu-heat-equation.pptx
 Labs:
   Multi-GPU Heat Equation: Module[24]-HeatEquation.pdf
 Quiz:
   Module 24 Quiz.pdf

Module 25: Using CUDA Libraries
In this module we introduce the effective use of CUDA libraries.
 Lectures:
  25.1 cuBLAS
   pdfLecture-25-1-cublas.pdf
   pptxLecture-25-1-cublas.pptx
  25.2 cuSOLVER
   pdfLecture-25-2-cusolver.pdf
   pptxLecture-25-2-cusolver.pptx
  25.3 cuFFT
   pdfLecture-25-3-cufft.pdf
   pptxLecture-25-3-cufft.pptx
  25.4 Thrust
   pdfLecture-25-4-thrust.pdf
   pptxLecture-25-4-thrust.pptx
 Labs:
   Heat Equation with NVIDIA libraries: Module[25]-HeatEquationLibs.pdf
 Quiz:
   Module 25 Quiz.pdf
 Book Chapters:
   THRUST: a productivity-oriented library for CUDA: 3rd-Edition-AppendixB-thrust

Module 26: Advanced Thrust
In this module we discuss advanced Thrust topics.
(Module scheduled for a future relase of the teaching kit.)