Module 1: Course Introduction |
In this module we review course goals and syllabus and introduce the concepts of heterogeneous and parallel programming. |
  | Lectures: |
  |   | 1.1 Course Introduction and Overview |
  |   |   |  | pdf | Lecture-1-1-overview.pdf |
  |   |   |  | pptx | Lecture-1-1-overview.pptx |
  |   |   | | | Video Lecture |
  |   | 1.2 Introduction to Heterogeneous Parallel Computing |
  |   |   |  | pdf | Lecture-1-2-heterogeneous-computing.pdf |
  |   |   |  | pptx | Lecture-1-2-heterogeneous-computing.pptx |
  |   |   | | | Video Lecture |
  |   | 1.3 Portability and Scalability in Heterogeneous Parallel Computing |
  |   |   |  | pdf | Lecture-1-3-portability-scalability.pdf |
  |   |   |  | pptx | Lecture-1-3-portability-scalability.pptx |
  |   |   | | | Video Lecture |
  | Book Chapters: |
  |   |   |  | | Chapter 1 - Introduction: 3rd-Edition-Chapter01-introduction.pdf |
|
Module 2: Introduction to CUDA C |
In this module we cover the basic API functions in CUDA host code and introduce CUDA threads, the main mechanism for exploiting data parallelism. |
  | Lectures: |
  |   | 2.1 CUDA C vs. Thrust vs. CUDA Libraries |
  |   |   |  | pdf | Lecture-2-1-cuda-thrust-libs.pdf |
  |   |   |  | pptx | Lecture-2-1-cuda-thrust-libs.pptx |
  |   |   | | | Video Lecture |
  |   | 2.2 Memory Allocation and Data Movement API Functions |
  |   |   |  | pdf | Lecture-2-2-cuda-data-allocation-API.pdf |
  |   |   |  | pptx | Lecture-2-2-cuda-data-allocation-API.pptx |
  |   |   | | | Video Lecture |
  |   | 2.3 Threads and Kernel Functions |
  |   |   |  | pdf | Lecture-2-3-cuda-parallelism-threads.pdf |
  |   |   |  | pptx | Lecture-2-3-cuda-parallelism-threads.pptx |
  |   |   | | | Video Lecture |
  |   | 2.4 Introduction to the CUDA Toolkit |
  |   |   |  | pdf | Lecture-2-4-cuda-toolkit.pdf |
  |   |   |  | pptx | Lecture-2-4-cuda-toolkit.pptx |
  |   |   | | | Video Lecture |
  |   | 2.5 Nsight Compute and NSight Systems |
  |   |   |  | pdf | Lecture-2-5-nsight-systems-compute.pdf |
  |   |   |  | pptx | Lecture-2-5-nsight-systems-compute.pptx |
  |   |   | | | Video Lecture |
  |   | 2.6 Unified Memory |
  |   |   |  | pdf | Lecture-2-6-unified-memory.pdf |
  |   |   |  | pptx | Lecture-2-6-unified-memory.pptx |
  |   |   | | | Video Lecture |
  | Labs: |
  |   |   |  | | Device Query: Module[2]-DeviceQuery.pdf |
  |   |   |  | | CUDA Toolkit: Lab-2.4.cuda-toolkit.zip |
  | Quiz: |
  |   |   |  | | Module 2 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 2 - Data Parallel Computing: 3rd-Edition-Chapter02-data-parallel-computing.pdf |
|
Module 3: CUDA Parallelism Model |
In this module we introduce the CUDA kernel, efficient memory access patterns, and thread scheduling. |
  | Lectures: |
  |   | 3.1 Kernel-Based SPMD Parallel Programming |
  |   |   |  | pdf | Lecture-3-1-kernel-SPMD-parallelism.pdf |
  |   |   |  | pptx | Lecture-3-1-kernel-SPMD-parallelism.pptx |
  |   |   | | | Video Lecture |
  |   | 3.2 Multidimensional Kernel Configuration |
  |   |   |  | pdf | Lecture-3-2-kernel-multidimension.pdf |
  |   |   |  | pptx | Lecture-3-2-kernel-multidimension.pptx |
  |   |   | | | Video Lecture |
  |   | 3.3 Color-to-Grayscale Image Processing Example |
  |   |   |  | pdf | Lecture-3-3-color-to-greyscale-image-processing-example.pdf |
  |   |   |  | pptx | Lecture-3-3-color-to-greyscale-image-processing-example.pptx |
  |   |   | | | Video Lecture |
  |   | 3.4 Image Blur Example |
  |   |   |  | pdf | Lecture-3-4-blur-kernel.pdf |
  |   |   |  | pptx | Lecture-3-4-blur-kernel.pptx |
  |   |   | | | Video Lecture |
  |   | 3.5 Thread Scheduling |
  |   |   |  | pdf | Lecture-3-5-transparent-scaling.pdf |
  |   |   |  | pptx | Lecture-3-5-transparent-scaling.pptx |
  |   |   | | | Video Lecture |
  | Labs: |
  |   |   |  | | NVIDIA DLI Online Course: Fundamentals of Accelerated Computing with CUDA C/C++, Section 1: Accelerating Applications with CUDA C/C++ |
  |   |   |  | | CUDA Image Blur: Module[3]-ImageBlur.pdf |
  |   |   |  | | CUDA Image Color to Grayscale: Module[3]-ImageColorToGrayscale.pdf |
  |   |   |  | | CUDA Thrust Vector Add: Module[3]-ThrustVectorAdd.pdf |
  |   |   |  | | CUDA Vector Add: Module[3]-VectorAdd.pdf |
  | Quiz: |
  |   |   |  | | Module 3 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 3 - Scalable Parallel Execution: 3rd-Edition-Chapter03-scalable-parallel-execution.pdf |
|
Module 4: Memory and Data Locality |
In this module we introduce the CUDA memory types and explore their effective use in tiled parallel algorithms. |
  | Lectures: |
  |   | 4.1 CUDA Memories |
  |   |   |  | pdf | Lecture-4-1-cuda-memories.pdf |
  |   |   |  | pptx | Lecture-4-1-cuda-memories.pptx |
  |   |   | | | Video Lecture |
  |   | 4.2 Tiled Parallel Algorithms |
  |   |   |  | pdf | Lecture-4-2-tiled-algorithms.pdf |
  |   |   |  | pptx | Lecture-4-2-tiled-algorithms.pptx |
  |   |   | | | Video Lecture |
  |   | 4.3 Tiled Matrix Multiplication |
  |   |   |  | pdf | Lecture-4-3-tiled-matrix-multiplication.pdf |
  |   |   |  | pptx | Lecture-4-3-tiled-matrix-multiplication.pptx |
  |   |   | | | Video Lecture |
  |   | 4.4 Tiled Matrix Multiplication Kernel |
  |   |   |  | pdf | Lecture-4-4-tiled-matrix-multiplication-kernel.pdf |
  |   |   |  | pptx | Lecture-4-4-tiled-matrix-multiplication-kernel.pptx |
  |   |   | | | Video Lecture |
  |   | 4.5 Handling Arbitrary Matrix Sizes in Tiled Algorithms |
  |   |   |  | pdf | Lecture-4-5-tile-boundary-condition.pdf |
  |   |   |  | pptx | Lecture-4-5-tile-boundary-condition.pptx |
  |   |   | | | Video Lecture |
  | Labs: |
  |   |   |  | | NVIDIA DLI Online Course: Fundamentals of Accelerated Computing with CUDA C/C++, Section 2: Managing Accelerated Application Memory with CUDA Unified Memory and nsys |
  |   |   |  | | Basic Matrix Multiplication: Module[4]-BasicMatrixMultiplication.pdf |
  |   |   |  | | CUDA Tiled Matrix Multiplication: Module[4]-TiledMatrixMultiplication.pdf |
  | Quiz: |
  |   |   |  | | Module 4 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 4 - Memory and Data Locality: 3rd-Edition-Chapter04-memory-and-data-locality.pdf |
|
Module 5: Thread Execusion Efficiency |
In this module we explore how CUDA threads execute on SIMD Hardware and how to analyze the performance impact of control divergence. |
  | Lectures: |
  |   | 5.1 Warps and SIMD Hardware |
  |   |   |  | pdf | Lecture-5-1-warps-simd.pdf |
  |   |   |  | pptx | Lecture-5-1-warps-simd.pptx |
  |   |   | | | Video Lecture |
  |   | 5.2 Performance Impact of Control Divergence |
  |   |   |  | pdf | Lecture-5-2-control-divergence.pdf |
  |   |   |  | pptx | Lecture-5-2-control-divergence.pptx |
  |   |   | | | Video Lecture |
  | Quiz: |
  |   |   |  | | Module 5 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 5 - Performance Considerations: 3rd-Edition-Chapter05-performance-considerations.pdf |
|
Module 6: Memory Access Performance |
In this module we explore the significance of memory coalescing to effectively utilize memory bandwidth in CUDA. |
  | Lectures: |
  |   | 6.1 DRAM Bandwidth |
  |   |   |  | pdf | Lecture-6-1-dram-bandwidth.pdf |
  |   |   |  | pptx | Lecture-6-1-dram-bandwidth.pptx |
  |   |   | | | Video Lecture |
  |   | 6.2 Memory Coalescing in CUDA |
  |   |   |  | pdf | Lecture-6-2-memory-coalescing.pdf |
  |   |   |  | pptx | Lecture-6-2-memory-coalescing.pptx |
  |   |   | | | Video Lecture |
  | Quiz: |
  |   |   |  | | Module 6 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 5 - Performance Considerations: 3rd-Edition-Chapter05-performance-considerations.pdf |
|
Module 7: Parallel Computation Patterns (Histogram) |
In this module we introduce the parallel histogram computation pattern and learn to write a high performance kernel by privatizing outputs. |
  | Lectures: |
  |   | 7.1 Histogramming |
  |   |   |  | pdf | Lecture-7-1-histogram.pdf |
  |   |   |  | pptx | Lecture-7-1-histogram.pptx |
  |   |   | | | Video Lecture |
  |   | 7.2 Introduction to Data Races |
  |   |   |  | pdf | Lecture-7-2-data-race.pdf |
  |   |   |  | pptx | Lecture-7-2-data-race.pptx |
  |   |   | | | Video Lecture |
  |   | 7.3 Atomic Operations in CUDA |
  |   |   |  | pdf | Lecture-7-3-CUDA-Atomic.pdf |
  |   |   |  | pptx | Lecture-7-3-CUDA-Atomic.pptx |
  |   |   | | | Video Lecture |
  |   | 7.4 Atomic Operation Performance |
  |   |   |  | pdf | Lecture-7-4-atomic-performance.pdf |
  |   |   |  | pptx | Lecture-7-4-atomic-performance.pptx |
  |   |   | | | Video Lecture |
  |   | 7.5 Privatization Technique for Improved Throughput |
  |   |   |  | pdf | Lecture-7-5-privatized-histogram.pdf |
  |   |   |  | pptx | Lecture-7-5-privatized-histogram.pptx |
  |   |   | | | Video Lecture |
  | Labs: |
  |   |   |  | | Histogram: Module[7]-Histogram.pdf |
  |   |   |  | | Text Histogram: Module[7]-TextHistogram.pdf |
  |   |   |  | | Thrust Histogram Sort: Module[7]-ThrustHistogramSort.pdf |
  | Quiz: |
  |   |   |  | | Module 7 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 11 - Parallel Patterns: Parallel Histogram Computation: 3rd-Edition-Chapter09-parallel-histogram-conputation.pdf |
|
Module 8: Parallel Computation Patterns (Stencil) |
In this module we introduce the tiled convolution pattern. We will learn to analyze the cost and benefit of tiled parallel convolution algorithms. |
  | Lectures: |
  |   | 8.1 Convolution |
  |   |   |  | pdf | Lecture-8-1-convolution.pdf |
  |   |   |  | pptx | Lecture-8-1-convolution.pptx |
  |   |   | | | Video Lecture |
  |   | 8.2 Tiled Convolution |
  |   |   |  | pdf | Lecture-8-2-tiled-convolution.pdf |
  |   |   |  | pptx | Lecture-8-2-tiled-convolution.pptx |
  |   |   | | | Video Lecture |
  |   | 8.3 Tile Boundary Conditions |
  |   |   |  | pdf | Lecture-8-3-tile-boundary-condition.pdf |
  |   |   |  | pptx | Lecture-8-3-tile-boundary-condition.pptx |
  |   |   | | | Video Lecture |
  |   | 8.4 Analyzing Data Reuse in Tiled Convolution |
  |   |   |  | pdf | Lecture-8-4-convolution-reuse.pdf |
  |   |   |  | pptx | Lecture-8-4-convolution-reuse.pptx |
  |   |   | | | Video Lecture |
  | Labs: |
  |   |   |  | | Convolution: Module[8]-Convolution.pdf |
  |   |   |  | | Stencil: Module[8]-Stencil.pdf |
  | Quiz: |
  |   |   |  | | Module 8 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 7 - Parallel Patterns: Convolution: 3rd-Edition-Chapter07-convolution.pdf |
|
Module 9: Parallel Computation Patterns (Reduction) |
In this module we introduce the parallel reduction pattern. |
  | Lectures: |
  |   | 9.1 Parallel Reduction |
  |   |   |  | pdf | Lecture-9-1-reduction.pdf |
  |   |   |  | pptx | Lecture-9-1-reduction.pptx |
  |   |   | | | Video Lecture |
  |   | 9.2 A Basic Reduction Kernel |
  |   |   |  | pdf | Lecture-9-2-reduction-kernel.pdf |
  |   |   |  | pptx | Lecture-9-2-reduction-kernel.pptx |
  |   |   | | | Video Lecture |
  |   | 9.3 A Better Reduction Kernel |
  |   |   |  | pdf | Lecture-9-3-better-reduction-kernel.pdf |
  |   |   |  | pptx | Lecture-9-3-better-reduction-kernel.pptx |
  |   |   | | | Video Lecture |
  | Labs: |
  |   |   |  | | Reduction: Module[9]-Reduction.pdf |
  |   |   |  | | Thrust Reduction: Module[9]-ThrustReduction.pdf |
  | Quiz: |
  |   |   |  | | Module 9 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 5 - Performance Considerations: 3rd-Edition-Chapter05-performance-considerations.pdf |
|
Module 10: Parallel Computation Patterns (Scan) |
In this module we introduce the parallel scan (prefix sum) pattern. |
  | Lectures: |
  |   | 10.1 Prefix Sum |
  |   |   |  | pdf | Lecture-10-1-scan-parallel-prefix-sum.pdf |
  |   |   |  | pptx | Lecture-10-1-scan-parallel-prefix-sum.pptx |
  |   |   | | | Video Lecture |
  |   | 10.2 A Work-inefficient Scan Kernel |
  |   |   |  | pdf | Lecture-10-2-work-inefficient-scan-kernel.pdf |
  |   |   |  | pptx | Lecture-10-2-work-inefficient-scan-kernel.pptx |
  |   |   | | | Video Lecture |
  |   | 10.3 A Work-Efficient Parallel Scan Kernel |
  |   |   |  | pdf | Lecture-10-3-work-efficient-scan-kernel.pdf |
  |   |   |  | pptx | Lecture-10-3-work-efficient-scan-kernel.pptx |
  |   |   | | | Video Lecture |
  |   | 10.4 More on Parallel Scan |
  |   |   |  | pdf | Lecture-10-4-more-on-parallel-scan.pdf |
  |   |   |  | pptx | Lecture-10-4-more-on-parallel-scan.pptx |
  |   |   | | | Video Lecture |
  | Labs: |
  |   |   |  | | List Scan: Module[10]-ListScan.pdf |
  |   |   |  | | Thrust List Reduction: Module[10]-ThrustListScan.pdf |
  | Quiz: |
  |   |   |  | | Module 10 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 9 - Parallel Patterns: PrefixSum: 3rd-Edition-Chapter08-prefix-sum.pdf |
|
Module 11: Breadth-First (BFS) Queue |
In this module we cover Breadth-First Search Queue. |
  | Labs: |
  |   |   |  | | Breadth-First Search Queue: Module[11]-BfsQueue.pdf |
|
Module 12: Floating-Point Considerations |
In this module we introduce the fundmentals of floating-point representation. |
  | Lectures: |
  |   | 12.1 Floating-Point Precision and Accuracy |
  |   |   |  | pdf | Lecture-12-1-floating-point-basics.pdf |
  |   |   |  | pptx | Lecture-12-1-floating-point-basics.pptx |
  |   |   | | | Video Lecture |
  |   | 12.2 Numerical Stability |
  |   |   |  | pdf | Lecture-12-2-numerical-stability.pdf |
  |   |   |  | pptx | Lecture-12-2-numerical-stability.pptx |
  |   |   | | | Video Lecture |
  | Book Chapters: |
  |   |   |  | | Chapter 6 - Numerical Considerations: 3rd-Edition-Chapter06-numerical-considerations.pdf |
|
Module 13: GPU as Part of the PC Architecture |
In this module we introduce how GPUs fit in the PC architecture. |
  | Lectures: |
  |   | 13.1 GPU as Part of the PC Architecture |
  |   |   |  | pdf | Lecture-13-GPU-in-PC-Architecture.pdf |
  |   |   |  | pptx | Lecture-13-GPU-in-PC-Architecture.pptx |
  |   |   | | | Video Lecture |
  | Book Chapters: |
  |   |   |  | | Chapter 18 - Programming a Heterogeneous Computing Cluster: 3rd-Edition-Chapter18-heterogeneous-cluster.pdf |
|
Module 14: Efficient Host-Device Data Transfer |
In this module we discuss important concepts involved in copying (transferring) data between host and device. |
  | Lectures: |
  |   | 14.1 Pinned Host Memory |
  |   |   |  | pdf | Lecture-14-1-Data-Transfer.pdf |
  |   |   |  | pptx | Lecture-14-1-Data-Transfer.pptx |
  |   |   | | | Video Lecture |
  |   | 14.2 Task Parallelism in CUDA |
  |   |   |  | pdf | Lecture-14-2-CUDA-Streams.pdf |
  |   |   |  | pptx | Lecture-14-2-CUDA-Streams.pptx |
  |   |   | | | Video Lecture |
  |   | 14.3 Overlapping Data Transfer with Computation |
  |   |   |  | pdf | Lecture-14-3-Overlap-Transfer.pdf |
  |   |   |  | pptx | Lecture-14-3-Overlap-Transfer.pptx |
  |   |   | | | Video Lecture |
  |   | 14.4 CUDA Unified Memory |
  |   |   |  | pdf | Lecture-14-4-cuda-unified-memory.pdf |
  |   |   |  | pptx | Lecture-14-4-cuda-unified-memory.pptx |
  |   |   | | | Video Lecture |
  | Labs: |
  |   |   |  | | NVIDIA DLI Online Course: Fundamentals of Accelerated Computing with CUDA C/C++, Section 3: Asynchronous Streaming, and Visual Profiling for Accelerated Applications with CUDA C/C++ |
  |   |   |  | | Vector Addition Using CUDA Streams: Module[14]-VectorAdd_Stream.pdf |
  |   |   |  | | Vector Addition Using Pinned Memory: Module[14]-PinnedMemoryStreamsVectorAdd.pdf |
  |   |   |  | | CUDA Unified Memory Matrix Multiplication: Module[14]-UMMatrixMultiplication.pdf |
  | Quiz: |
  |   |   |  | | Module 14 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 18 - Programming a Heterogeneous Computing Cluster: 3rd-Edition-Chapter18-heterogeneous-cluster.pdf |
  |   |   |  | | Chapter 20 - More on CUDA and Grahpics Processing Unit Computing: 3rd-Edition-Chapter20-more-cuda-gpu-computing.pdf |
|
Module 15: Application Case Study: Advanced MRI Reconstruction |
In this module we introduce the MRI Reconstruction case study. |
  | Lectures: |
  |   | 15.1 Advanced MRI Reconstruction |
  |   |   |  | pdf | Lecture-15-1-MRI-reconstruction.pdf |
  |   |   |  | pptx | Lecture-15-1-MRI-reconstruction.pptx |
  |   |   | | | Video Lecture |
  |   | 15.2 Kernel Optimizations |
  |   |   |  | pdf | Lecture-15-2-MRI-kernel-optimization.pdf |
  |   |   |  | pptx | Lecture-15-2-MRI-kernel-optimization.pptx |
  |   |   | | | Video Lecture |
  | Book Chapters: |
  |   |   |  | | Chapter 14 - Application Case Study - Non-Cartesian Magnetic Resonance Imaging: 3rd-Edition-Chapter14-case-study-MRI.pdf |
|
Module 16: Application Case Study: Electrostatic Potential Calculation |
In this module we introduce the Electrostatic Potential Calculation case study. |
  | Lectures: |
  |   | 16.1 Electrostatic Potential Calculation - Part 1 |
  |   |   |  | pdf | Lecture-16-1-VMD-case-study-Part1.pdf |
  |   |   |  | pptx | Lecture-16-1-VMD-case-study-Part1.pptx |
  |   |   | | | Video Lecture |
  |   | 16.2 Electrostatic Potential Calculation - Part 2 |
  |   |   |  | pdf | Lecture-16-2-VMD-case-study-Part2.pdf |
  |   |   |  | pptx | Lecture-16-2-VMD-case-study-Part2.pptx |
  |   |   | | | Video Lecture |
|
Module 17: Computational Thinking for Parallel Programming |
In this module we provide a framework for thinking about the problems of parallel programming |
  | Lectures: |
  |   | 17.1 Introduction to Computational Thinking |
  |   |   |  | pdf | Lecture-17-1-Computational-Thinking.pdf |
  |   |   |  | pptx | Lecture-17-1-Computational-Thinking.pptx |
  | Book Chapters: |
  |   |   |  | | Chapter 17 - Parallel Programming and Computational Thinking: 3rd-Edition-Chapter17-computational-thinking.pdf |
|
Module 18: Related Programming Models: MPI |
In this module we introduce the MPI programming model. |
  | Lectures: |
  |   | 18.1 Introduction to Heterogeneous Supercomputing and MPI |
  |   |   |  | pdf | Lecture-18-MPI-CUDA-intro.pdf |
  |   |   |  | pptx | Lecture-18-MPI-CUDA-intro.pptx |
  | Book Chapters: |
  |   |   |  | | Chapter 18 - Programming a Heterogeneous Computing Cluster: 3rd-Edition-Chapter18-heterogeneous-cluster.pdf |
|
Module 19: CUDA Python using Numba |
In this module we introduce CUDA Python using Numba. |
  | Labs: |
  |   |   |  | | NVIDIA DLI Online Course: Fundamentals of Accelerated Computing with CUDA Python |
|
Module 20: Related Programming Models: OpenCL |
In this module we introduce the OpenCL programming model. |
  | Lectures: |
  |   | 20.1 OpenCL Data Parallelism Model |
  |   |   |  | pdf | Lecture-20-1-opencl-parallelism.pdf |
  |   |   |  | pptx | Lecture-20-1-opencl-parallelism.pptx |
  |   | 20.2 OpenCL Device Architecture |
  |   |   |  | pdf | Lecture-20-2-opencl-architecture.pdf |
  |   |   |  | pptx | Lecture-20-2-opencl-architecture.pptx |
  |   | 20.3 OpenCL Host Code |
  |   |   |  | pdf | Lecture-20-3-opencl-host-code.pdf |
  |   |   |  | pptx | Lecture-20-3-opencl-host-code.pptx |
  | Labs: |
  |   |   |  | | OpenCL Vector Addition: Module[20]-OpenCLVectorAddition.pdf |
  | Quiz: |
  |   |   |  | | Module 20 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Appendix - An Introduction to OpenCL: 3rd-Edition-AppendixA-intro-to-OpenCL.pdf |
|
Module 21: Related Programming Models: OpenACC |
In this module we introduce the OpenACC programming model. |
  | Lectures: |
  |   | 21.1 Introduction to OpenACC |
  |   |   |  | pdf | Lecture-21-1-openACC-intro.pdf |
  |   |   |  | pptx | Lecture-21-1-openACC-intro.pptx |
  |   |   | | | Video Lecture |
  |   | 21.2 OpenACC Subtleties |
  |   |   |  | pdf | Lecture-21-2-openACC-subtleties.pdf |
  |   |   |  | pptx | Lecture-21-2-openACC-subtleties.pptx |
  |   |   | | | Video Lecture |
  | Labs: |
  |   |   |  | | NVIDIA DLI Online Course: Fundamentals of Accelerated Computing with OpenACC |
  |   |   |  | | OpenACC CUDA Vector Add: Module[21]-OpenACCVectorAdd.pdf |
  | Quiz: |
  |   |   |  | | Module 21 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 15 - Parallel Programming with OpenACC: 3rd-Edition-Chapter19-programming-with-OpenACC.pdf |
|
Module 22: Related Programming Models: OpenGL |
In this module we introduce the OpenGL programming model. |
(Module scheduled for a future relase of the teaching kit.) |
|
Module 23: Dynamic Parallelism |
In this module we introduce dynamic parallelism. |
  | Lectures: |
  |   | 23.1 Dynamic Parallelism |
  |   |   |  | pdf | Lecture-23-Dynamic-parallelism.pdf |
  |   |   |  | pptx | Lecture-23-Dynamic-parallelism.pptx |
  |   |   | | | Video Lecture |
  | Labs: |
  |   |   |  | | Dynamic Parallelism: Module[23]-DynamicParallelism.pdf |
  | Book Chapters: |
  |   |   |  | | Chapter 13 - CUDA dynamic parallelism: 3rd-Edition-Chapter13-cuda-dynamic-parallelism |
|
Module 24: Multi-GPU |
In this module we discuss programming with multiple GPUs. |
  | Lectures: |
  |   | 24.1 OpenMP |
  |   |   |  | pdf | Lecture-24-1-openmp.pdf |
  |   |   |  | pptx | Lecture-24-1-openmp.pptx |
  |   | 24.2 Multi-GPU Introduction I |
  |   |   |  | pdf | Lecture-24-2-multi-gpu-introduction-i.pdf |
  |   |   |  | pptx | Lecture-24-2-multi-gpu-introduction-i.pptx |
  |   | 24.3 Multi-GPU Introduction II |
  |   |   |  | pdf | Lecture-24-3-multi-gpu-introduction-ii.pdf |
  |   |   |  | pptx | Lecture-24-3-multi-gpu-introduction-ii.pptx |
  |   | 24.4 OpenMP and Cooperative Groups |
  |   |   |  | pdf | Lecture-24-4-openmp-and-cooperative-groups.pdf |
  |   |   |  | pptx | Lecture-24-4-openmp-and-cooperative-groups.pptx |
  |   | 24.5 Multi-GPU Heat Equation |
  |   |   |  | pdf | Lecture-24-5-multi-gpu-heat-equation.pdf |
  |   |   |  | pptx | Lecture-24-5-multi-gpu-heat-equation.pptx |
  | Labs: |
  |   |   |  | | Multi-GPU Heat Equation: Module[24]-HeatEquation.pdf |
  | Quiz: |
  |   |   |  | | Module 24 Quiz.pdf |
|
Module 25: Using CUDA Libraries |
In this module we introduce the effective use of CUDA libraries. |
  | Lectures: |
  |   | 25.1 cuBLAS |
  |   |   |  | pdf | Lecture-25-1-cublas.pdf |
  |   |   |  | pptx | Lecture-25-1-cublas.pptx |
  |   | 25.2 cuSOLVER |
  |   |   |  | pdf | Lecture-25-2-cusolver.pdf |
  |   |   |  | pptx | Lecture-25-2-cusolver.pptx |
  |   | 25.3 cuFFT |
  |   |   |  | pdf | Lecture-25-3-cufft.pdf |
  |   |   |  | pptx | Lecture-25-3-cufft.pptx |
  |   | 25.4 Thrust |
  |   |   |  | pdf | Lecture-25-4-thrust.pdf |
  |   |   |  | pptx | Lecture-25-4-thrust.pptx |
  | Labs: |
  |   |   |  | | Heat Equation with NVIDIA libraries: Module[25]-HeatEquationLibs.pdf |
  | Quiz: |
  |   |   |  | | Module 25 Quiz.pdf |
  | Book Chapters: |
  |   |   |  | | THRUST: a productivity-oriented library for CUDA: 3rd-Edition-AppendixB-thrust |
|
Module 26: Advanced Thrust |
In this module we discuss advanced Thrust topics. |
(Module scheduled for a future relase of the teaching kit.) |
|