#### INTELLECTUAL PROPERTY RIGHTS NOTICE:

- The User may only download, make and retain a copy of the materials for his/her use for non-commercial and research purposes.
- The User may not commercially use the material, unless has been granted prior written consent by the Licensor to do so; and cannot remove, obscure or modify copyright notices, text acknowledging or other means of identification or disclaimers as they appear.
- For further details, please contact BSC-CNS patc@bsc.es

(#)

PATC training, Barcelona, May 2012



#### **PRACE Training Course: Introduction to CUDA Programming**

Area: Core HPC Curriculum

Level: BEGINNERS: for trainees from different background and very little knowledge

#### **Prerequisites:**

Basic knowledge of C/C++ programming Attendees will need to bring their own laptops with a SSH client

#### Convener: Isaac Gelado

#### **Objectives:**

The aim of this course is to provide students with knowledge and hands-on experience in developing applications software for processors with massively parallel computing resources. In general, we refer to a processor as massively parallel if it has the ability to complete more than 64 arithmetic operations per clock cycle. Many commercial offerings from NVIDIA, AMD, and Intel already offer such levels of concurrency. Effectively programming these processors will require in-depth knowledge about parallel programming principles, as well as the parallelism models, communication models, and resource limitations of these processors. The target audiences of the course are students who want to develop exciting applications for these processors, as well as those who want to develop programming tools and future implementations for these processors.

#### Learning Outcomes:

The students who finish this course will learn how to program massively parallel processors and achieve high performance, functionality, maintainability, and scalability across future generations.

The students who finish this course will acquire technical knowledge required to achieve the above goals by learning principles and patterns of parallel algorithms, processor architecture features and constraints, and programming API, tools and techniques.



#### Timetable

#### Day 1 / Session 1 / 9am - 1 pm: (3h lectures with 5 min breaks on the hour)

- 1. Introduction to CUDA
- 2. CUDA Threading Model (I)
- 3. CUDA Threading Model (II)

Session 2 / 2 pm- 6 pm: 3h practical session – lab exercises

#### Day 2 / Session 3 / 9am- 1 pm: (3h practical session)

- 1. CUDA Memory Model
- 2. Matrix Multiplication Shared Memory
- 3. 2D Convolution Constant Memory

Session 4 / 2 pm- 6 pm: 3h practical session – lab exercises

#### Day 3 / Session 5 / 9am- 1 pm: (3h practical session)

- 1. CUDA Memory Model
- 2. Matrix Multiplication Shared Memory
- 3. 2D Convolution Constant Memory

Session 6 / 2 pm- 6 pm: 3h practical session – lab exercises

#### Day 4 / Session 7 / 9am- 1 pm: (3h practical session)

- 1. Parallel Reductions
- 2. Memory Bandwidth Considerations
- 3. Prefix Scan

Session 8 / 2 pm- 6 pm: 3h practical session – lab exercises

END of COURSE



#### www.bsc.es



**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# **Introduction to CUDA Programming**

Lecture 1: Introduction

#### Disclaimer

- ( All the material of this Seminar is based on the ECE408 course imparted at the University of Illinois
- ( All the credit for this material goes to:
  - Prof. Wen-mei W. Hwu
    - Full Professor at the ECE and CS Departments in the University of Illinois
    - Director of the Blue-Waters Supercomputer
  - David Kirk
    - NVIDIA Fellow
    - Professor of ECE







## **Course Goals**

- ( Learn how to program massively parallel processors and achieve
  - high performance
  - functionality and maintainability
  - scalability across future generations
- ( Acquire technical knowledge required to achieve the above goals
  - principles and patterns of parallel algorithms
  - processor architecture features and constraints
  - programming API, tools and techniques



#### Text/Notes

- D. Kirk and W. Hwu, "Programming Massively Parallel Processors – A Hands-on Approach," Morgan Kaufman Publisher, 2010, ISBN 978-0123814722
- 2. NVIDIA, *NVidia CUDA C Programming Guide*, version 4.0, NVidia, 2011 (reference book)
- 3. Lecture notes and recordings will be posted at the class web site





**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# THE ADVENT OF GPUS

#### Performance Advantage of GPUs

- ( An enlarging peak performance advantage:
  - Calculation: 1 TFLOPS vs. 100 GFLOPS
  - Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s



#### Courtesy: John Owens



## GPU computing is catching on



#### ( 280 submissions to GPU Computing Gems

110 articles included in two volumes



#### CPUs vs. GPUs

(CPUs and GPUs have fundamentally different design philosophies





DRAM



#### **CPUs: Latency Oriented Design**

#### ( Large caches

Convert long latency memory accesses to short latency cache accesses

#### ( Sophisticated control

- Branch prediction for reduced branch latency
- Data forwarding for reduced data latency
- ( Powerful ALU
  - Reduced operation latency





## **GPUs: Throughput Oriented Design**

## ( Small caches

To boost memory throughput

## ( Simple control

- No branch prediction
- No data forwarding
- ( Energy efficient ALUs
  - Many, long latency but heavily pipelined for high throughput

## ( Require massive number of threads to tolerate latencies





## **Stretching Traditional Architectures**

- (( Traditional parallel architectures cover some super-applications
  - DSP, GPU, network apps, Scientific
- (( The game is to grow mainstream architectures "out" or domain-specific architectures "in"
  - CUDA is latter







# (CPUs for sequential parts where latency matters

 CPUs can be 10+X faster than GPUs for sequential code (COPUs for parallel parts where throughput wins

> GPUs can be 10+X faster than CPUs for parallel code



## A Common GPU Usage Pattern

- ( A desirable approach considered impractical
  - Due to excessive computational requirement
  - But demonstrated to achieve domain benefit
  - Convolution filtering (e.g. bilateral Gaussian filters), De Novo gene assembly, etc.
- ( Use GPUs to accelerate the most time-consuming aspects of the approach
  - Kernels in CUDA
  - Refactor host code to better support kernels
- ( Rethink the domain problem



## **CUDA** /OpenCL – Execution Model

- ( Integrated host+device app C program
  - Serial or modestly parallel parts in host C code
  - Highly parallel parts in device SPMD kernel C code







**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# THE GPU MODEL

# From Natural Language to Electrons

| -<br>Compiler → - | Natural Language (e.g, English) |  |
|-------------------|---------------------------------|--|
|                   | Algorithm                       |  |
|                   | High-Level Language (C/C++)     |  |
|                   | Instruction Set Architecture    |  |
|                   | Microarchitecture               |  |
|                   | Circuits                        |  |
|                   | Electrons                       |  |

©Yale Patt and Sanjay Patel, From bits and bytes to gates and beyond



## The ISA

( An Instruction Set Architecture (ISA) is a contract between the hardware and the software.

( As the name suggests, it is a set of instructions that the architecture (hardware) can execute.



#### A program at the ISA level

( A program is a set of instructions stored in memory that can be read, interpreted, and executed by the hardware.

( Program instructions operate on data stored in memory or provided by Input/Output (I/O) device.



#### The Von-Neumann Model





## Arrays of Parallel Threads

( A CUDA kernel is executed by a grid (array) of threads

- All threads in a grid run the same kernel code (SPMD)
- Each thread has an index that it uses to compute memory addresses and make control decisions





#### **Thread Blocks: Scalable Cooperation**

- ( Divide thread array into multiple blocks
  - Threads within a block cooperate via shared memory, atomic operations and barrier synchronization
  - Threads in different blocks cannot cooperate





#### blockIdx and threadIdx

- Each thread uses indices to decide what data to work on
  - blockIdx: 1D, 2D, or 3D (CUDA 4.0)
  - threadIdx: 1D, 2D, or 3D
- Simplifies memory addressing when processing multidimensional data
  - Image processing
  - Solving PDEs on volumes
- Host **Device** Grid 1 Kernel Block Block (0, 0)(1, 0)Block Block (0, 1)(1, 1)Grid 2 Kernel 2 Block (1, 1) (1.0.1)(2,0,1)Thread Thread Thread Thread (0.0.0)(1,0,0)(2,0,0)(3,0,0)Thread Thread Thread T hread (1,1,0)(3,1,0)(0,1,0)(2,1,0)



. . .





#### Vector Addition – Traditional C Code

```
// Compute vector sum C = A+B
void vecAdd(float* A, float* B, float* C, int n)
{
  for (i = 0, i < n, i++)
    C[i] = A[i] + B[i];
}
int main()
{
    // Memory allocation for A h, B h, and C h
     // I/O to read A h and B h, N elements
     ...
    vecAdd(A_h, B_h, C_h, N);
```



#### Heterogeneous Computing vecAdd Host Code



# Partial Overview of CUDA Memories

- ( Device code can:
  - R/W per-thread registers
  - R/W per-grid global memory
- ( Host code can
  - Transfer data to/from per grid global memory



We will cover more later.



## **CUDA Device Memory Management API functions**

## ( cudaMalloc()

- Allocates object in the device global memory
- Two parameters
  - Address of a pointer to the allocated object
  - Size of of allocated object in terms of bytes
- (( cudaFree()
  - Frees object from device global memory
    - Pointer to freed object





## Host-Device Data Transfer API functions

## (( cudaMemcpy()

- Memory data transfer
- Requires four parameters
  - Pointer to destination
  - Pointer to source
  - Number of bytes copied
  - Type of transfer
- Transfer to device is asynchronous





#### Heterogeneous Computing vecAdd Host Code

```
void vecAdd(float* A, float* B, float* C, int n)
{
    int size = n * sizeof(float);
    float* A_d, B_d, C_d;
```

```
1. // Transfer A and B to device memory
  cudaMalloc((void **) &A_d, size);
  cudaMemcpy(A_d, A, size);
  cudaMalloc((void **) &B_d, size);
  cudaMemcpy (B_d, B, size);
```

```
// Allocate device memory for
cudaMalloc((void **) &C_d, size);
```

2. // Kernel invocation code - to be shown later

```
3. // Transfer C from device to host
cudaMemcpy (C, C_d, size);
    // Free device memory for A, B, C
cudaFree(A_d); cudaFree(B_d); cudaFree (C_d);
```



#### **Example: Vector Addition Kernel**



```
int vectAdd(float* A, float* B, float* C, int n)
```

```
// A_d, B_d, C_d allocations and copies omitted
// Run ceil(n/256) blocks of 256 threads each
vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);
```



{

#### **Example: Vector Addition Kernel**

```
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
global
void vecAddkernel(float* A_d, float* B_d, float* C_d, int n)
ſ
    int i = threadIdx.x + blockDim.x * blockIdx.x;
    if(i < n) C d[i] = A d[i] + B d[i];
}
                                                         Host Code
int vecAdd(float* A, float* B, float* C, int n)
{
 // A_d, B_d, C_d allocations and copies omitted
 // Run ceil(n/256) blocks of 256 threads each
  vecAddKernnel<<<ceil(n/256),256>>>(A d, B d, C d, n);
}
```





```
int vecAdd(float* A, float* B, float* C, int n)
{
   // A_d, B_d, C_d allocations and copies omitted
   // Run ceil(n/256) blocks of 256 threads each
   dim3 DimGrid(ceil(n/256), 1, 1);
   dim3 DimBlock(256, 1, 1);
```

```
vecAddKernnel<<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n);
}
```

# ( Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking



# More on CUDA Function Declarations

|                                      | Executed on the: | Only callable<br>from the: |
|--------------------------------------|------------------|----------------------------|
| <pre>device float DeviceFunc()</pre> | device           | device                     |
| global void KernelFunc()             | device           | host                       |
| <u>host</u> float HostFunc()         | host             | host                       |

- \_\_global\_\_ defines a kernel function
  - Each "\_\_\_" consists of two underscore characters
  - A kernel function must return void
- <u>device</u> and <u>host</u> can be used together



# Kernel execution in a nutshell





Supercomputing Center Centro Nacional de Supercomputación

#### www.bsc.es



**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# Introduction to CUDA Programming Lecture 2: CUDA Parallel Execution Model

# **Block IDs and Thread IDs**

- ( Each thread uses IDs to decide what data to work on
  - Block ID: 1D, 2D, or 3D
  - Thread ID: 1D, 2D, or 3D
- ( Simplifies memory addressing when processing multidimensional data
  - Image processing
  - Solving PDEs on volumes





. . .

# A Simple Example: Matrix Multiplication

- ( A simple illustration of the basic features of memory and thread management in CUDA programs
  - Thread index usage
  - Memory layout
  - Register usage
  - Assume square matrix for simplicity
  - Leave shared memory usage until later



# **Square Matrix-Matrix Multiplication**

- ( P = M \* N of size WIDTH x WIDTH
  - Each thread calculates one element of P
  - Each row of M is loaded WIDTH times from global memory
  - Each column of N is loaded WIDTH times from global memory





# Memory Layout of a Matrix in C

| M <sub>0,0</sub> | M <sub>1,0</sub> | M <sub>2,0</sub> | М <sub>3,0</sub> |
|------------------|------------------|------------------|------------------|
| M <sub>0,1</sub> | M <sub>1,1</sub> | M <sub>2,1</sub> | M <sub>3,1</sub> |
| M <sub>0,2</sub> | M <sub>1,2</sub> | M <sub>2,2</sub> | M <sub>3,2</sub> |
| M <sub>0,3</sub> | M <sub>1,3</sub> | M <sub>2,3</sub> | M <sub>3,3</sub> |







# **Square Matrix-Matrix Multiplication**

```
void MatrixMul(float* M, float* N, float* P,
                 int Width)
                                                                            k
{
  for (int i = 0; i < Width; ++i)
     for (int j = 0; j < Width; ++j) {</pre>
        double sum = 0;
        for (int k = 0; k < Width; ++k) {
          double a = M[i * Width + k];
          double b = N[k * Width + j];
          sum += a * b;
      }
      P[i * Width + j] = sum;
   }
 }
                                     K
```



# Kernel Function - A Small Example

( Have each 2D thread block to compute a (TILE\_WIDTH)<sup>2</sup> sub-matrix (tile) of the result matrix
 – Each has (TILE\_WIDTH)<sup>2</sup> threads

( Generate a 2D Grid of (WIDTH/TILE\_WIDTH)<sup>2</sup> blocks





# A Slightly Bigger Example





| В                | lock(            | 0,0)             |                  | Blo              | ock(1            | ,0)              |                  |                                                             |
|------------------|------------------|------------------|------------------|------------------|------------------|------------------|------------------|-------------------------------------------------------------|
|                  |                  |                  |                  |                  | \<br>            | 4                |                  | 1                                                           |
| P <sub>0,0</sub> | P <sub>1,0</sub> | P <sub>2,0</sub> | $P_{3,0}$        | P <sub>4,0</sub> | P <sub>5,0</sub> | P <sub>6,0</sub> | P <sub>7,0</sub> |                                                             |
| P <sub>0,1</sub> | P <sub>1,1</sub> | P <sub>2,1</sub> | P <sub>3,1</sub> | P <sub>4,1</sub> | P <sub>5,1</sub> | P <sub>6,1</sub> | P <sub>7,1</sub> |                                                             |
| P <sub>0,2</sub> | P <sub>1,2</sub> | P <sub>2,2</sub> | P <sub>3,2</sub> | P <sub>4,2</sub> | P <sub>5,2</sub> | P <sub>6,2</sub> | P <sub>7,2</sub> | WIDTH = 8; TILE_WIDTH = 4<br>Each block has 4*4 =16 threads |
| P <sub>0,3</sub> | P <sub>1,3</sub> | P <sub>2,3</sub> | P <sub>3,3</sub> | 4 <sub>4,3</sub> | P <sub>5,3</sub> | P <sub>6,3</sub> | P <sub>7,3</sub> |                                                             |
| P <sub>0,4</sub> | P <sub>1,4</sub> | P <sub>2,4</sub> | P <sub>3,4</sub> | P <sub>4,4</sub> | P <sub>5,4</sub> | P <sub>6,4</sub> | P <sub>7,4</sub> | WIDTH/TILE_WIDTH = 2<br>Use 2* 2 = 4 blocks                 |
| P <sub>0,5</sub> | P <sub>1,5</sub> | P <sub>2,5</sub> | P <sub>3,5</sub> | P <sub>4,5</sub> | P <sub>5,5</sub> | P <sub>6,5</sub> | P <sub>7,5</sub> |                                                             |
| P <sub>0,6</sub> | P <sub>1,6</sub> | P <sub>2,6</sub> | P <sub>3,6</sub> | P <sub>4,6</sub> | P <sub>5,6</sub> | P <sub>6,6</sub> | P <sub>7,6</sub> |                                                             |
| P <sub>0,7</sub> | P <sub>1,7</sub> | P <sub>2,7</sub> | P <sub>3,7</sub> | P <sub>4,7</sub> | P <sub>5,7</sub> | P <sub>6,7</sub> | P <sub>7,7</sub> |                                                             |



### Kernel Invocation (Host-side Code)

/\* Setup the execution configuration \*/
/\* TILE\_WIDTH is a #define constant \*/
dim3 dimGrid(Width/TILE\_WIDTH Width/TILE\_WIDTH, 1);
dim3 dimBlock(TILE\_WIDTH, TILE\_WIDTH, 1);

/\* Launch the device computation threads!\*/
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);



# **Kernel Function**

/\* Matrix multiplication kernel - per thread code \*/
\_\_global\_\_ void MatrixMulKernel(float\* d\_M, float\* d\_N, float\* d\_P,
 int Width)

```
{
```

/\* Pvalue is used to store the element of the matrix
that is computed by the thread \*/
float Pvalue = 0;



# Block (0,0) in a TILE\_WIDTH = 2 Configuration

blockIdx.y



13

N<sub>2,3</sub> N<sub>3,3</sub>

Row = 0 Row = 1

 $Col = Q * (TILE_WIDTH) + threadIdx.x$ 

Row =  $0^{*}$  (TILE\_WIDTH) + threadIdx.y

| M <sub>e,e</sub> | M <sub>1,0</sub> | M <sub>2,0</sub> | M <sub>ə,ə</sub> | <br>P <sub>o,c</sub> | <b>P</b> <sub>10</sub> | P <sub>2,0</sub> | P <sub>3,0</sub> |
|------------------|------------------|------------------|------------------|----------------------|------------------------|------------------|------------------|
| M <sub>e,1</sub> | M <sub>1,1</sub> | M <sub>2,1</sub> | M <sub>0,1</sub> | <br>₽°,†             | <b>₽</b> 1,1           | P <sub>2,1</sub> | P <sub>3,1</sub> |
| M <sub>0,2</sub> | M <sub>1,2</sub> | M <sub>2,2</sub> | M <sub>3,2</sub> | P <sub>0,2</sub>     | P <sub>1,2</sub>       | P <sub>2,2</sub> | P <sub>3,2</sub> |
| M <sub>0,3</sub> | M <sub>1,3</sub> | M <sub>2,3</sub> | M <sub>3,3</sub> | P <sub>0,3</sub>     | P <sub>1,3</sub>       | P <sub>2,3</sub> | P <sub>3,3</sub> |

Ν



blockIdx.x

#### Work for Block (1,0)

blockIdx.x

Col = 3 Col = 2





Col = 1 \* (TILE\_WIDTH) + threadIdx.x Row = 0 \* (TILE\_WIDTH) + threadIdx.y

blockIdx.y





#### Work for Block (0,1)





#### Work for Block (1,1)

Col Col Ш Col = 1 \* (TILE\_WIDTH) + threadIdx.x Row = 1 \* (TILE\_WIDTH) + threadIdx.y ω N N<sub>0,0</sub> N<sub>1,0</sub> N<sub>20</sub> N<sub>30</sub> N<sub>1,1</sub> N<sub>0,1</sub> blockIdx.x blockIdx.y N<sub>0,2</sub> |  $N_{1,2} | N_{22}$ IN<sub>3</sub> N<sub>0,3</sub> | N<sub>1,3</sub> P<sub>0,0</sub> P<sub>1,0</sub>  $M_{0,0} | M_{1,0} | M_{2,0} | M_{3,0}$ H 3 P<sub>0,1</sub> P<sub>1,1</sub>  $M_{0,1} | M_{1,1} | M_{2,1} | M_{3,1}$ Row = 2Mag Mag Ma P Р M Row = 3NЛ Р



# A Simple Matrix Multiplication Kernel

```
_global__ void MatrixMulKernel(float* d_M, float* d_N, float* d_P, int
Width)
```

```
/* Calculate the row index of the Pd element and M */
int Row = blockIdx.y*blockDim.y + threadIdx.y;
/* Calculate the column idenx of Pd and N */
int Col = blockIdx.x*blockDim.x + threadIdx.x;
```

```
float Pvalue = 0;
/* Each thread computes one element of the block sub- matrix */
for (int k = 0; k < Width; ++k)
    Pvalue += d_M[Row*Width+k] * d_N[k*Width+Col];</pre>
```

#### d\_P[Row\*Width+Col] = Pvalue;



{

# **CUDA Thread Block**

- ( All threads in a block execute the same kernel program (SPMD)
- ( Programmer declares block:
  - Block size 1 to 1024 concurrent threads
  - Block shape 1D, 2D, or 3D
  - Block dimensions in threads
- ( Threads have thread index numbers within block
  - Kernel code uses thread index and block index to select work and address shared data
- ( Threads in the same block share data and synchronize while doing their share of the work
- ( Threads in different blocks cannot cooperate
  - Each block can execute in any order relative to other blocks!

#### **CUDA Thread Block**



Courtesy: John Nickolls, NVIDIA





**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# **REVIEW OF PARALLEL EXECUTION**

- (1<sup>st</sup> gen Instructions are executed sequentially in program order, one at a time.
- ( Example:

| Cycle        | 1     | 2      | 3       | 4      | 5     | 6      |
|--------------|-------|--------|---------|--------|-------|--------|
| Instruction1 | Fetch | Decode | Execute | Memory |       |        |
| Instruction2 |       |        |         |        | Fetch | Decode |



- ( 2<sup>nd</sup> gen Instructions are executed sequentially, in program order, in an assembly line fashion. (pipeline)
- ( Example:

| Cycle        | 1     | 2      | 3       | 4       | 5       | 6      |
|--------------|-------|--------|---------|---------|---------|--------|
| Instruction1 | Fetch | Decode | Execute | Memory  |         |        |
| Instruction2 |       | Fetch  | Decode  | Execute | Memory  |        |
| Instruction3 |       |        | Fetch   | Decode  | Execute | Memory |



# History – Instruction Level Parallelism

( 3<sup>rd</sup> gen - Instructions are executed in parallel

(( Example code 1:  

$$c = b + a;$$
  
 $d = c + e;$  Non-parallelizable



#### Instruction Level Parallelism (Cont.)

### ( Two forms of ILP:

 Superscalar: At runtime, fetch, decode, and execute multiple instructions at a time. Execution may be out of order

| Cycle        | 1     | 2      | 3       | 4       | 5      |
|--------------|-------|--------|---------|---------|--------|
| Instruction1 | Fetch | Decode | Execute | Memory  |        |
| Instruction2 | Fetch | Decode | Execute | Memory  |        |
| Instruction3 |       | Fetch  | Decode  | Execute | Memory |
| Instruction4 |       | Fetch  | Decode  | Execute | Memory |

 VLIW: At compile time, pack multiple, independent instructions in one large instruction and process the large instructions as the atomic units.



- ( 4<sup>th</sup> gen Multi-threading: multiple threads are executed in an alternating or simultaneous manner on the same processor/core. (will revisit)
- ( 5<sup>th</sup> gen Multi-Core: Multiple threads are executed simultaneously on multiple processors



#### **Transparent Scalability**

( Hardware is free to assigns blocks to any processor at any time

 A kernel scales across any number of parallel processors





# **Example: Executing Thread Blocks**



( Threads are assigned to Streaming Multiprocessors in block granularity

- Up to 8 blocks to each SM as resource allows
- Fermi SM can take up to 1536 threads (256 (threads/block) \* 6 blocks or 512 (threads/block) \* 3 blocks, etc.)



# The Von-Neumann Model





# **Example: Thread Scheduling**

- (C Each Block is executed as 32-thread Warps
- ( An implementation decision, not part of the CUDA programming model
- Warps are scheduling units in SM
- If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?
- (C Each Block is divided into 256/32 = 8 Warps
- ( There are 8 \* 3 = 24 Warps







- ( Every instruction needs to be fetched from memory, decoded, then executed.
- ( Instructions come in three flavors: Operate, Data transfer, and Program Control Flow.
- ( An example instruction cycle is the following:
  - Fetch | Decode | Execute | Memory





**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# **INSTRUCTIONS AND PERFORMANCE**

# (C Example of an operate instruction: ADD R1, R2, R3

# ((Instruction cycle for an operate instruction: Fetch | Decode | Execute | Memory



# (C Examples of data transfer instruction: LDR R1, R2, #2 STR R1, R2, #2

( Instruction cycle for an operate instruction: Fetch | Decode | Execute | Memory



# ( Example of control flow instruction: BRp #-4 if the condition is positive, jump back four instructions

( Instruction cycle for an arithmetic instruction: Fetch | Decode | Execute | Memory



#### How thread blocks are partitioned

- ( Thread blocks are partitioned into warps
  - Thread IDs within a warp are consecutive and increasing
  - Warp 0 starts with Thread ID 0
- ( Partitioning is always the same
  - Thus you can use this knowledge in control flow
  - However, the exact size of warps may change from generation to generation
  - (Covered next)

# ( However, DO NOT rely on any ordering between warps

If there are any dependencies between threads, you must \_\_\_\_\_\_syncthreads() to get correct results (more later).



# **Control Flow Instructions**

- ( Main performance concern with branching is divergence
  - Threads within a single warp take different paths
  - Different execution paths are serialized in current GPUs
- ( A common case: avoid divergence when branch condition is a function of thread ID
  - Example with divergence: if(threadIdx.x > 2) { }
    - This creates two different control paths for threads in a block
    - Branch granularity < warp size; threads 0, 1 and 2 follow different path than the rest of the threads in the first warp
  - Example without divergence: if(threadIdx.x / WARP\_SIZE > 2) { }
    - Also creates two different control paths for threads in a block
    - Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path



# Example: Thread Scheduling (Cont.)

#### ( SM implements zero-overhead warp scheduling

- At any time, 1 or 2 of the warps is executed by SM
- Warps whose next instruction has its operands ready for consumption are eligible for execution
- Eligible Warps are selected for execution on a prioritized scheduling policy
- All threads in a warp execute the same instruction when selected



TB = Thread Block, W = Warp



# **Block Granularity Considerations**

- ( For Matrix Multiplication using multiple blocks, should I use 8X8, 16X16 or 32X32 blocks?
  - For 8X8, we have 64 threads per Block. Since each SM can take up to 1536 threads, there are 24 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!
  - For 16X16, we have 256 threads per Block. Since each SM can take up to 1536 threads, it can take up to 6 Blocks and achieve full capacity unless other resource considerations overrule.
  - For 32X32, we would have 1024 threads per Block. Only one block can fit into an SM for Fermi. Using only 2/3 of the thread capacity of an SM. Also, this works for CUDA 3.0 and beyond but too large for some early CUDA versions.



# **Application Programming Interface**

- ( The API is an extension to the C programming language( It consists of:
  - Language extensions
    - To target portions of the code for execution on the device
  - A runtime library split into:
    - A common component providing built-in vector types and a subset of the C runtime library in both host and device codes
    - A host component to control and access one or more devices from the host
    - A device component providing device-specific functions



# **Common Runtime Component: Mathematical Functions**

- ( pow, sqrt, cbrt, hypot
- ( exp, exp2, expm1
- (( log, log2, log10, log1p
- (( sin, cos, tan, asin, acos, atan, atan2
- (( sinh, cosh, tanh, asinh, acosh, atanh
- ([ ceil, floor, trunc, round
  - When executed on the host, a given function uses the C runtime implementation if available
  - These functions are only supported for scalar types, not vector types



# **Device Runtime Component: Mathematical Functions**

- (( Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. \_\_sin(x))
  - \_\_pow
  - \_\_log, \_\_log2, \_\_log10
  - \_\_exp



#### www.bsc.es



Barcelona Supercomputing Center Centro Nacional de Supercomputación

# Introduction to CUDA Programming Lecture 7: Data Transfer and CUDA Streams

### Objective

( To understand the major factors that dictate performance when using GPU as an compute accelerator for the CPU

- The feeds and speeds of the traditional CPU world
- The feeds and speeds when employing a GPU
- To form a solid knowledge base for performance programming in modern GPU's
- (Knowing yesterday, today, and tomorrow
  - The PC world is becoming flatter
  - CPU and GPU are being Fused together
  - Outsourcing of computation is becoming easier...



#### Allocate/Free Pinned Memory (a.k.a. Page Locked Memory)

#### (**CudaHostAlloc**()

- Three parameters
- Address of pointer to the allocated memory
- Size of the allocated memory in bytes
- Option use cudaHostAllocDefault for now

#### (cudaFreeHost()

- One parameter
- Pointer to the memory to be freed



### **Using Pinned Memory**

- ( Use the allocated memory and its pointer the same way those returned by malloc();
- ( The only difference is that the allocated memory cannot be paged by the OS
- ( The cudaMemCpy function should be about 2X faster with pinned memory



### Serialized Data Transfer and GPU computation

( So far, the way we use cudaMemCpy serializes data transfer and GPU computation





#### **Device** Overlap

#### ( Some CUDA devices support device overlap

 Simultaneously execute a kernel while performing a copy between device and host memory

int Device; cudaDeviceProp prop;

cudaGetDevice(&Device); cudaGetDeviceProperties(&prop, Device);

if (prop.deviceOverlap) ...



### **Overlapped (Pieplined) Timing**

- ( Divide large vectors into segments
- ( Overlap transfer and compute of adjacent segments





# Using CUDA Streams and Asynchronous MemCpy

- (CUDA supports parallel execution of kernels and MemCpy with "Streams"
- ( Each stream is a queue of operations (kernels and MemCpys)
- ( Operations in different streams can go in parallel
  - "Task parallelism"



### **Conceptual View of Streams**





#### A Simple Multi-Stream Host Code

```
cudaStream_t stream0, stream1;
cudaStreamCreate( &stream0);
cudaStreamCreate( &stream1);
float *d_A0, *d_B0, *d_C0; // device memory for stream 0
float *d_A1, *d_B1, *d_C1; // device memory for stream 1
```

```
// cudaMalloc for d_A0, d_B0, d_C0, d_A1, d_B1, d_C1 go here
```

```
for (int i=0; i<n; i+=SegSize*2) {
  cudaMemCpyAsync(d_A0, h_A+i; SegSize*sizeof(float),.., stream0);
  cudaMemCpyAsync(d_B0, h_B+i; SegSize*sizeof(float),.., stream0);
  vecAdd<<<SegSize/256, 256, 0, stream0);
  cudaMemCpyAsync(d_C0, h_C+I; SegSize*sizeof(float),.., stream0);</pre>
```



# A Simple Multi-Stream Host Code (Cont.)



#### A View Closer to Reality



BSC Barcelona Supercomputing Center Centro Nacional de Supercomputación

ón

#### Not quite the overlap want

#### (C.1 blocks A.2 and B.2 in the copy engine queue





# A Better Multi-Stream Host Code (Cont.)

```
for (int i=0; i<n; i+=SegSize*2) {</pre>
  cudaMemCpyAsync(d A0, h A+i; SegSize*sizeof(float),.., stream0);
  cudaMemCpyAsync(d B0, h B+i; SegSize*sizeof(float),.., stream0);
  cudaMemCpyAsync(d A1, h A+i+SegSize;
                                        SegSize*sizeof(float),..., stream1);
  cudaMemCpyAsync(d B1, h B+i+SegSize;
                                        SegSize*sizeof(float),..., stream1);
  vecAdd<<<SegSize/256, 256, 0, stream0)(d A0, d B0, ...);</pre>
  vecAdd<<<SegSize/256, 256, 0, stream1>>>(d A1, d B1, ...);
  cudaMemCpyAsync(d_C0, h_C+I; SegSize*sizeof(float),.., stream0);
  cudaMemCpyAsync(d C1, h C+i+SegSize;
                                        SegSize*sizeof(float),.., stream1);
```



}

#### A View Closer to Reality



BSC Barcelona Supercomputing Center Centro Nacional de Supercomputación

# **Overlapped (Pipelined) Timing**

- ( Divide large vectors into segments
- ( Overlap transfer and compute of adjacent segments





#### www.bsc.es



**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

# Introduction to CUDA Programming Lecture 3: Tiled Matrix-Matrix Multiplication

#### **Transparent Scalability**

- Hardware is free to assigns blocks to any processor at any time
  - A kernel scales across any number of parallel processors



# **Example: Executing Thread Blocks**



- ( Threads are assigned to Streaming Multiprocessors in block granularity
  - Up to 8 blocks to each SM as resource allows
  - Fermi SM can take up to **1536** threads
    - Could be 256 (threads/block) \* 6 blocks
    - Or 512 (threads/block) \* 3 blocks, etc.







### The Von-Neumann Model with SIMD units





# **Example: Thread Scheduling**

- (C Each Block is executed as 32-thread Warps
- ( An implementation decision, not part of the CUDA programming model
- Warps are scheduling units in SM
- If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM?
- (C Each Block is divided into 256/32 = 8 Warps
- ( There are 8 \* 3 = 24 Warps







### How thread blocks are partitioned

- ( Thread blocks are partitioned into warps
  - Thread IDs within a warp are consecutive and increasing
  - Warp 0 starts with Thread ID 0
- ( Partitioning is always the same
  - Thus you can use this knowledge in control flow
  - However, the exact size of warps may change from generation to generation
  - (Covered next)

# ( However, DO NOT rely on any ordering between warps

 If there are any dependencies between threads, you must \_\_syncthreads() to get correct results (more later).



# ( Example of control flow instruction: BRp #-4 if the condition is positive, jump back four instructions

# ( Instruction cycle for an arithmetic instruction: Fetch | Decode | Execute | Memory



# **Control Flow Instructions**

- ( Main performance concern with branching is divergence
  - Threads within a single warp take different paths
  - Different execution paths are serialized in current GPUs
- ( A common case: avoid divergence when branch condition is a function of thread ID
  - Example with divergence: If (threadIdx.x > 2) { }
    - This creates two different control paths for threads in a block
    - Branch granularity < warp size; threads 0, 1 and 2 follow different path than the rest of the threads in the first warp
  - Example without divergence: If (threadIdx.x / WARP\_SIZE > 2) { }
    - Also creates two different control paths for threads in a block
    - Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path



# Example: Thread Scheduling (Cont.)

- ( SM implements zero-overhead warp scheduling
  - At any time, 1 or 2 of the warps is executed by SM
  - Warps whose next instruction has its operands ready for consumption are eligible for execution
  - Eligible Warps are selected for execution on a prioritized scheduling policy
  - All threads in a warp execute the same instruction when selected





# **Outline of Tiling Technique**

- ( Identify a block/tile of global memory content that are accessed by multiple threads
- ( Load the block/tile from global memory into on-chip memory
- ( Have the multiple threads to access their data from the onchip memory
- ( Move on to the next block/tile



# Idea: Use Shared Memory to reuse global memory data

- Each input element is read by WIDTH threads.
- Load each element into Shared Memory and have several threads use the local version to reduce the memory bandwidth
  - Tiled algorithms









# **Tiled Multiply**



0

2

by 1



DX

2



## Loading a Tile

#### ( All threads in a block participate

- Each thread loads one Md element and one Nd element in based tiled code
- ( Assign the loaded element to each thread such that the accesses within each warp is coalesced (more later).



#### Work for Block (0,0)



SM

| Maa                | M                  | $M_{\circ}$       | Maa                |     | M   |
|--------------------|--------------------|-------------------|--------------------|-----|-----|
| 0,0                | 1,0                | 2,0               | 3,0                | 0,0 | 1,0 |
| IVI <sub>0,1</sub> | IVI <sub>1,1</sub> | VI <sub>2,1</sub> | IVI <sub>3,1</sub> | 0,1 | 1,1 |
| M <sub>0,2</sub>   | M <sub>1,2</sub>   | M <sub>2,2</sub>  | $M_{3,2}$          |     |     |
| M <sub>0,3</sub>   | M <sub>1,3</sub>   | M <sub>2,3</sub>  | M <sub>3,3</sub>   |     |     |

| P <sub>0,0</sub> | P <sub>1,0</sub> | P <sub>2,0</sub> | P <sub>3,0</sub> |
|------------------|------------------|------------------|------------------|
| P <sub>0,1</sub> | P <sub>1,1</sub> | P <sub>2,1</sub> | P <sub>3,1</sub> |
| P <sub>0,2</sub> | P <sub>1,2</sub> | P <sub>2,2</sub> | P <sub>3,2</sub> |
| P <sub>0,3</sub> | P <sub>1,3</sub> | P <sub>2,3</sub> | P <sub>3,3</sub> |

SM



| N <sub>0,0</sub> | N <sub>1,0</sub> | N <sub>2,0</sub> | N <sub>3,0</sub> |
|------------------|------------------|------------------|------------------|
| N <sub>0,1</sub> | N <sub>1,1</sub> | N <sub>2,1</sub> | N <sub>3,1</sub> |
| N <sub>0,2</sub> | N <sub>1,2</sub> | N <sub>2,2</sub> | N <sub>3,2</sub> |
| N <sub>0,3</sub> | N <sub>1,3</sub> | N <sub>2,3</sub> | N <sub>3,3</sub> |

| M <sub>0,0</sub> | $M_{1,0}$        | M <sub>2,0</sub> | M <sub>3,0</sub> |
|------------------|------------------|------------------|------------------|
| M <sub>0,1</sub> | M <sub>1,1</sub> | M <sub>2,1</sub> | M <sub>3,1</sub> |
| M <sub>0,2</sub> | $M_{1,2}$        | M <sub>2,2</sub> | $M_{3,2}$        |
| M <sub>0,3</sub> | M <sub>1,3</sub> | M <sub>2,3</sub> | M <sub>3,3</sub> |











| N <sub>0,0</sub> | N <sub>1,0</sub> | N <sub>2,0</sub> | N <sub>3,0</sub> |
|------------------|------------------|------------------|------------------|
| N <sub>0,1</sub> | N <sub>1,1</sub> | N <sub>2,1</sub> | N <sub>3,1</sub> |
| N <sub>0,2</sub> | N <sub>1,2</sub> | N <sub>2,2</sub> | N <sub>3,2</sub> |
| N <sub>0,3</sub> | N <sub>1,3</sub> | N <sub>2,3</sub> | N <sub>3,3</sub> |

| M <sub>0,0</sub> | M <sub>1,0</sub> | M <sub>2,0</sub> | M <sub>3,0</sub> |
|------------------|------------------|------------------|------------------|
| M <sub>0,1</sub> | M <sub>1,1</sub> | M <sub>2,1</sub> | M <sub>3,1</sub> |
| M <sub>0,2</sub> | M <sub>1,2</sub> | M <sub>2,2</sub> | M <sub>3,2</sub> |
| M <sub>0,3</sub> | M <sub>1,3</sub> | M <sub>2,3</sub> | M <sub>3,3</sub> |





#### **Barrier Synchronization**

- ( An API function call in CUDA
  - \_\_synchthreads()
- ( Best used to coordinate tiled algorithms
  - To ensure that all elements of a tile are loaded
  - To ensure that all elements of a tile are consumed



## Loading an M Tile



## **Tiled Matrix Multiplication Kernel**

```
global void MatrixMul(float* d M, float* d N, float* d P, int Width)
1.
2.
      {
3.
       shared float ds M[TILE WIDTH][TILE WIDTH];
4.
        shared float ds N[TILE WIDTH][TILE WIDTH];
        int bx = blockIdx.x; int by = blockIdx.y;
5.
        int tx = threadIdx.x; int ty = threadIdx.y;
6.
        // Identify the row and column of the Pd element to work on
7.
        int Row = by * TILE WIDTH + ty;
8.
9.
        int Col = bx * TILE WIDTH + tx;
10.
        float Pvalue = 0;
11.
        // Loop over the Md and Nd tiles required to compute the Pd element
        for (int m = 0; m < Width/TILE WIDTH; ++m) {</pre>
12.
13.
          // Coolaborative loading of Md and Nd tiles into shared memory
          ds M[ty][tx] = d M[Row*Width + m*TILE WIDTH+tx];
14.
          ds N[ty][tx] = d N[Col+(m*TILE WIDTH+ty)*Width];
15.
16.
          syncthreads();
          for (int k = 0; k < TILE WIDTH; ++k)
17.
            Pvalue += ds M[ty][k] * ds_N[k][tx];
18.
          __synchthreads();
19.
20.
21.
        d P[Row*Width+Col] = Pvalue;
22.
      }
```



## Loading an N Tile

Upper left corner of N tile at step m: bx\*TILE\_WIDTH + m\*TILE\_WIDTH\*Width

Each thread uses ty and tx to load an element Upper left corner + ty \* Width + tx

= bx\*TILE\_WIDTH + m\*TILE\_WIDTH\*Width + ty \* Width + tx

= bx\*TILE\_WIDTH+tx + (m\*TILE\_WIDTH+ty)\* Width

0

2

by 1

tv

TILE WIDTH





by

k

m



#### **First-order Size Considerations**

- ( Each thread block should have many threads
  - TILE\_WIDTH of 16 gives 16\*16 = 256 threads
  - TILE\_WIDTH of 32 gives 32\*32 = 1024 threads
- ( For 16, each block performs 2\*256 = 512 float loads from global memory for 256 \* (2\*16) = 8,192 mul/add operations.
- ( For 32, each block performs 2\*1024 = 2048 float loads from global memory for 1024 \* (2\*32) = 65,536 mul/add operations



## **Shared Memory and Threading**

- ( Each SM in Fermi has 16KB or 48KB shared memory\*
  - SM size is implementation dependent!
     \*Configurable vs L1, total 64KE
  - For TILE\_WIDTH = 16, each thread block uses 2\*256\*4B = 2KB of shared memory.
  - Can potentially have up to 8 Thread Blocks actively executing
    - This allows up to 8\*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
  - The next TILE\_WIDTH 32 would lead to 2\*32\*32\*4B= 8KB shared memory usage per thread block, allowing 2 or 6 thread blocks active at the same time
- ( Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
  - The 150 GB/s bandwidth can now support: (150/4)\*16 = 600 GFLOPS!



## Summary-Typical Structure of a CUDA Program

- ( Global variables declaration
  - \_\_\_host\_\_\_
  - \_\_\_device\_\_\_... \_global\_\_\_, \_\_\_constant\_\_\_, \_\_texture\_\_\_
- ( Function prototypes
  - \_\_global\_\_\_void kernelOne(...)
  - float handyFunction(...)
- (( Main ()
  - allocate memory space on the device cudaMalloc(&d\_GlbIVarPtr, bytes)
  - transfer data from host to device cudaMemCpy(d\_GlbIVarPtr, h\_Gl..)
  - execution configuration setup
  - kernel call kernelOne<<<execution configuration>>>( args... );
  - transfer results from device to host cudaMemCpy(h\_GlblVarPtr,...)
  - optional: compare against golden (host computed) solution
- (Kernel void kernelOne(type args,...)
  - variables declaration auto, \_\_\_shared\_\_\_
    - automatic variables transparently assigned to registers
  - syncthreads()...
- ( Other functions
  - float handyFunction(int inVar...);





#### www.bsc.es



**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

## Introduction to CUDA Programming Lecture 4: Convolution, Constant Memory and Caching

#### 2D Convolution – Inside Cells



BSC Barcelona Supercomputing Center Centro Nacional de Supercomputación

#### 2D Convolution – Halo Cells





#### Access Pattern for M

( M is referred to as mask (a.k.a. kernel, filter, etc.)

- Elements of M are called mask (kernel, filter) coefficients
- (Calculation of all output P elements need M
- ( M is not changed during kernel
- ( Bonus M elements are accessed in the same order when calculating all P elements
- ( M is a good candidate for Constant Memory



# Programmer View of CUDA Memories

#### ( Each thread can:

- Read/write per-thread
   registers (~1 cycle)
- Read/write per-block
   shared memory (~5
   cycles)
- Read/write per-grid global memory (~500 cycles)
- Read/only per-grid constant memory (~5 cycles with caching)





#### **Memory Hierarchies**

- (( If every time we needed a piece of data, we had to go to main memory to get it, computers would take a lot longer to do anything
- ( On today's processors, main memory accesses take hundreds of cycles
- ( One solution: Caches



#### Cache - Cont'd

( In order to keep cache fast, it needs to be small, so we cannot fit the entire data set in it





#### Cache - Cont'd

- (Cache is unit of volatile memory storage
- ( A cache is an "array" of cache lines
- (Cache line can usually hold data from several consecutive memory addresses
- ( When data is requested from memory, an entire cache line is loaded into the cache, in an attempt to reduce main memory requests



#### Caches - Cont'd

#### Some definitions:

- Spatial locality: is when the data elements stored in consecutive memory locations are access consecutively
- Temporal locality: is when the same data element is access multiple times in short period of time
- ( Both spatial locality and temporal locality improve the performance of caches



- (C Scratchpad (shared memory in CUDA) is another type of temporary storage used to relieve main memory contention.
- ( In terms of distance from the processor, scratchpad is similar to L1 cache.
- ( Unlike cache, scratchpad does not necessarily hold a copy of data that is in main memory
- ((It requires explicit data transfer instructions, whereas cache doesn't



## **Cache Coherence Protocol**

( A mechanism for caches to propagate updates by their local processor to other caches (processors)



## Main Memory



## CPU and GPU have different caching philosophy

#### (CPU L1 caches are usually coherent

- L1 is also replicated for each core
- Even data that will be changed can be cached in L1
- Updates to local cache copy invalidates (or less commonly updates) copies in other caches
- Expensive in terms of hardware and disruption of services (cleaning bathrooms at airports..)

#### ( GPU L1 caches are usually incoherent

- Avoid caching data that will be modified



#### How to Use Constant Memory

- ( Host code allocates, initializes variables the same way as any other variables that need o be copied to the device
- ( Use cudaMemcpyToSymbol(dest, src, size) to copy the variable into the device memory
- ( This copy function tells the device that the variable will not be modified by the kernel and can be safely cached.



## More on Constant Caching

- (C Each SM has its own L1
   cache
  - Low latency, high bandwidth access by all threads
- ( However, there is no way for threads in one SM to update the L1 cache in other SMs
  - No L1 cache coherence



# This is not a problem if a variable is NOT modified by a kernel.



#### #define KERNEL\_SIZE 5

// Matrix Structure declaration
typedef struct {
 unsigned int width;
 unsigned int height;
 unsigned int pitch;
 float\* elements;
} Matrix;



## AllocateMatrix()

// Allocate a device matrix of dimensions height\*width
// If init == 0, initialize to all zeroes.
// If init == 1, perform random initialization.
// If init == 2, initialize matrix parameters, but do
// not allocate memory

```
Matrix AllocateMatrix(int height, int width, int init)
{
```

```
Matrix M;
M.width = M.pitch = width;
M.height = height;
int size = M.width * M.height;
M.elements = NULL;
```



## AllocateMatrix() (Cont.)

```
// Don't allocate memory on option 2
if(init == 2) return M;
M.elements = (float*) malloc(size*sizeof(float));
```

```
for(unsigned int i = 0; i < M.height * M.width; i++)
{
    M.elements[i] = (init == 0) ? (0.0f) :
        (rand() / (float)RAND_MAX);
    if(rand() % 2)    M.elements[i] = - M.elements[i]
}</pre>
```

#### return M;



#### Host Code

...

....

- // global variable, outside any function
   \_\_constant\_\_ float Mc[KERNEL\_SIZE][KERNEL\_SIZE];
  - // allocate N, P, initialize N elements, copy N to Nd
    Matrix M;
  - M = AllocateMatrix(KERNEL\_SIZE, KERNEL\_SIZE, 1);
  - // initialize M elements



## Tiling P

#### ( Use a thread block to calculate a tile of P

Thread Block size determined by the TILE\_SIZE



## Tiling N

(( Each N element is used in calculating up to KERNEL\_SIZE \* KERNEL\_SIZE P elements (all elements in the tile)





## High-Level Tiling Strategy

- ( Load a tile of N into shared memory (SM)
  - All threads participate in loading
  - A subset of threads then use each N element in SM





#### Input tiles need to be larger than output tiles.





- ( Use a thread block that matches input tile
- ( Each thread loads one element of the input tile
- ( Some threads do not participate in calculating output
- ( There will be if statements and control divergence



## Shifting from output coordinates to input coordinates





#### Shifting from output coordinates to input coordinate

- int tx = threadIdx.x;
- int ty = threadIdx.y;
- int row\_o = blockIdx.y \* TILE\_SIZE + ty;
- int col\_o = blockIdx.x \* TILE\_SIZE + tx;
- int row\_i = row\_o 2;
- int col\_i = col\_o 2;



#### Threads that loads halos outside N should return 0.0





```
float output = 0.0f;
```

```
if((row_i >= 0) && (row_i < N.height) &&
    (col_i >= 0) && (col_i < N.width) ) {
    Ns[ty][tx] = N.elements[row_i*N.width + col_i];
}
else{
    Ns[ty][tx] = 0.0f;
}</pre>
```



#### Some threads do not participate in calculating output

```
if(ty < TILE_SIZE && tx < TILE_SIZE){
   for(i = 0; i < 5; i++) {
     for(j = 0; j < 5; j++) {
        output += Mc[i][j] * Ns[i+ty][j+tx];
     }
}</pre>
```



#### Some threads do not write output

# if(row\_o < P.height && col\_o < P.width) P.elements[row\_o \* P.width + col\_o] = output;</pre>



#### #define BLOCK\_SIZE (TILE\_SIZE + 4)

#### dim3 dimBlock(BLOCK\_SIZE,BLOCK\_SIZE);

((In general, block size should be tile size + (kernel size -1)



#### **Tiling Benefit Analysis**

- ( Start with KERNEL\_SIZE = 5
- ( Each point in an input tile is used multiple times.
  - Each boundary point (blue) is used
     9 times
  - Each second boundary point (yellow) is used 16 times
  - Each inner boundary point (red) is used 25 times





#### **Reuse Analysis**

#### ( For TILE\_SIZE = 12

- 44 boundary points
- 36 boundary points
- 64 inside points
- Total uses 44\*9 + 36\*16 + 64\*25 = 396+576+1600 = 2572
- Average reuse = 2572/144 = 17.9

#### ( As TILE\_SIZE increases, the average reuse approach 25



#### In General

- ( The number of boundary layers is proportional to the KERNEL\_SIZE
- ( The maximal reuse of each data point is (KERNEL\_SIZE)<sup>2</sup>
- ( BLOCK\_SIZE is limited by the maximal number of threads in a thread block
- (( Input tile sizes could be could be N\*TILE\_SIZE +
   (KERNEL\_SIZE-1)
  - By having each thread to calculate N input points (thread coarsening)
  - N is limited is limited by the shared memory size
- ( KERNEL\_SIZE is decided by application needs



#### www.bsc.es



**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

## **Applied CUDA Programming**

Lecture 5: Reductions

#### **Partition and Summarize**

- ( A commonly used strategy for processing large input data sets
  - There is no required order of processing elements in a data set (associative and commutative)
  - Partition the data set into smaller chunks
  - Have each thread to process a chunk
  - Use a reduction tree to summarize the results from each chunk into the final answer
- ( We will focus on the reduction tree step for now.
- ( Google and Hadoop MapReduce frameworks are examples of this pattern



#### Reduction enables other techniques

- ( Reduction is also needed to clean up after some commonly used parallelizing transformations
- ( Privatization
  - Multiple threads write into an output location
  - Replicate the output location so that each thread has a private output location
  - Use a reduction tree to combine the values of private locations into the original output location



#### What is a reduction computation

- ( Summarize a set of input values into one value using a "reduction operation"
  - Max
  - Min
  - Sum
  - Product
  - Often with user defined reduction operation function as long as the operation
    - Is associative and commutative
    - Has a well-defined identity value (e.g., 0 for sum)



### A sequential reduction algorithm performs N operations

- ( Initialize the result as an identity value for the reduction operation
  - Smallest possible value for max reduction
  - Largest possible value for min reduction
  - 0 for sum reduction
  - 1 for product reduction
- ( Scan through the input and perform the reduction operation between the result value and the current input value



#### A parallel reduction performs N-1 Operations in log(N) steps





#### A Quick Analysis

- ( For N input values, the reduction tree performs
  - (1/2)N + (1/4)N + (1/8)N + ... (1/N) = (1 (1/N))N = N-1 operations
  - In Log (N) steps 1,000,000 input values take 20 steps
    - Assuming that we have enough execution resources
  - Average Parallelism (N-1)/Log(N))
    - For N = 1,000,000, average parallelism is 50,000
    - However, peak resource requirement is 500,000!
- ( This is a work-efficient parallel algorithm
  - The amount of work done is comparable to sequential
  - Many parallel algorithms are not work efficient



#### A Sum Reduction Example

- ( Parallel implementation:
  - Recursively halve # of threads, add two values per thread in each step
  - Takes log(n) steps for n elements, requires n/2 threads
- ( Assume an in-place reduction using shared memory
  - The original vector is in device global memory
  - The shared memory is used to hold a partial sum vector
  - Each step brings the partial sum vector closer to the sum
  - The final sum will be in element 0
  - Reduces global memory traffic due to partial sum values







#### A Sum Example



Active Partial Sum elements



#### Simple Thread Index to Data Mapping

- ( Each thread is responsible of an even-index location of the partial sum vector
  - One input is the location of responsibility
- ( After each step, half of the threads are no longer needed
- ( In each step, one of the inputs comes from an increasing distance away



#### A Simple Thread Block Design

(( Each thread block takes 2\* BlockDim input elements
 (( Each thread loads 2 elements into shared memory
 \_\_\_\_shared\_\_\_float partialSum[2\*BLOCK\_SIZE];

```
unsigned int t = threadIdx.x;
unsigned int start = 2*blockIdx.x*blockDim.x;
partialSum[t] =
    input[start + t];
partialSum[blockDim+t] =
    input[start+ blockDim.x+t];
```



#### **The Reduction Steps**

#### Why do we need **syncthreads()**?



#### **Barrier Synchronization**

- (( syncthreads() are needed to ensure that all elements of each version of partial sums have been generated before we proceed to the next step
- ( Why do we not need another syncthread() at the end of the reduction loop?



#### Back to the Global Picture

- ( Thread 0 in each thread block write the sum of the thread block in partialSum[0] into a vector indexed by the blockIdx.x
- ( There can be a large number of such sums if the original vector is very large
  - The host code may iterate and launch another kernel
- ( If there are only a small number of sums, the host can simply transfer the data back and add them together.



#### Some Observations

- ( In each iteration, two control flow paths will be sequentially traversed for each warp
  - Threads that perform addition and threads that do not
  - Threads that do not perform addition still consume execution resources
- ( No more than half of threads will be executing after the first step
  - All odd index threads are disabled after first step
  - After the 5<sup>th</sup> step, entire warps in each block will fail the if test, poor resource utilization but no divergence.
    - This can go on for a while, up to 5 more steps (1024/32=16= 2<sup>5</sup>), where each active warp only has one productive thread until all warps in a block retire



#### **Thread Index Usage Matters**

- ( In some algorithms, one can shift the index usage to improve the divergence behavior
  - Commutative and associative operators
- ( Example given an array of values, "reduce" them to a single value in parallel
  - Sum reduction: sum of all values in the array
  - Max reduction: maximum of all values in the array



. . .

#### A Better Strategy

( Always compact the partial sums into the first locations in the partialSum[] array

( Keep the active threads consecutive



#### An Example of 16 threads

Thread 0 Thread 1 Thread 2

Thread 14Thread 15





#### **A Better Reduction Kernel**



#### A Quick Analysis

#### ( For a 1024 thread block

- No divergence in the first 5 steps
- 1024, 512, 256, 128, 64, 32 consecutive threads are active in each step
- The final 5 steps will still have divergence



#### www.bsc.es



**Barcelona Supercomputing Center** Centro Nacional de Supercomputación

## Introduction to CUDA Programming Lecture 9: MPI and CUDA Programming

#### Outline

#### ( MPI for dummies

#### ( MPI meets CUDA

#### ( MPI and CUDA Example: 3D Stencil

#### ( MPI and CUDA 4.0



#### **Message Passing Interface**

- ( MPI is a standard message passing API
- ( Oriented to cluster machines
  - Distributed memory
  - Hides underlying interconnection network
- ( Processes execute on different nodes of a network



#### **MPI** Model

#### ( Many processes distributed in a cluster



- ( Each process computes part of the output
- ( Processes communicate with each other
- ( Processes can synchronize



### **MPI Message Types**

- ( Point-to-point communication
  - Send and Receive
- (Collective communication
  - Barrier
  - Broadcast
  - Reduce
  - Gather and Scatter



#### MPI Initialization, Info and Sync

- (( int MPI\_Init(int \*argc, char \*\*\*argv))
  - Initialize MPI
- ( MPI\_COMM\_WORLD
  - MPI group with all allocated nodes
- (( int MPI\_Comm\_rank (MPI\_Comm comm, int \*rank)
  - Rank of the calling process in group of comm
- (( int MPI\_Comm\_size (MPI\_Comm comm, int \*size)
  - Number of processes in the group of comm
- (( int MPI\_Barrier (MPI\_Comm comm)
  - Blocks the caller until all group members have called it; the call returns at any process only after all group members have entered the call.



#### **MPI Sending Data**

- (( int MPI\_Send(void \*buf, int count, MPI\_Datatype
   datatype, int dest, int tag, MPI\_Comm comm)
  - Buf: Initial address of send buffer (choice)
  - Count: Number of elements in send buffer (nonnegative integer)
  - Datatype: Datatype of each send buffer element (handle)
  - Dest: Rank of destination (integer)
  - Tag: Message tag (integer)
  - Comm: Communicator (handle)

#### ( DATA\_DISTRIBUTE: Send to all nodes



### **MPI Receiving Data**

- (( int MPI\_Recv(void \*buf, int count, MPI\_Datatype datatype, int source, int tag, MPI\_Comm comm, MPI\_Status \*status)
  - Buf: Initial address of receive buffer (choice)
  - Count: Maximum number of elements in receive buffer (integer)
  - Datatype: Datatype of each receive buffer element (handle)
  - Source: Rank of source (integer)
  - Tag: Message tag (integer)
  - Comm: Communicator (handle)
  - Status: Status object (Status)



# MPI Sending and Receiving Data

- (( int MPI\_Sendrecv(void \*sendbuf, int sendcount, MPI\_Datatype sendtype, int dest, int sendtag, void \*recvbuf, int recvcount, MPI\_Datatype recvtype, int source, int recvtag, MPI\_Comm comm, MPI\_Status \*status)
  - Sendbuf:Initial address of send buffer (choice)
  - Sendcount: Number of elements in send buffer (integer)
  - Sendtype: Type of elements in send buffer (handle)
  - Dest: Rank of destination (integer)
  - Sendtag: Send tag (integer)
  - Recvcount: Number of elements in receive buffer (integer)
  - Recvtype: Type of elements in receive buffer (handle)
  - Source: Rank of source (integer)
  - Recvtag: Receive tag (integer)
  - Comm: Communicator (handle)
  - Recvbuf: Initial address of receive buffer (choice)
  - Status: Status object (Status). This refers to the receive operation.





#### ( MPI for dummies

# ( MPI meets CUDA

# ( MPI and CUDA Example: 3D Stencil

#### ( MPI and CUDA 4.0



#### **CUDA-based cluster**

#### ( Each node contains NGPUs



# **CUDA and MPI Communication**

#### ( Source MPI process:

- cudaMemcpy(tmp,src, cudaMemcpyDeviceToHost)
- MPI\_Send()

#### ( Destination MPI process:

- MPI\_Recv()
- cudaMemcpy(dst, src, cudaMemcpyDeviceToDevice)







#### ( MPI for dummies

( MPI meets CUDA

( MPI and CUDA Example: 3D Stencil

( MPI and CUDA 4.0



```
int main(int argc, char *argv[]) {
    int pad = 0, dimx = 480+pad, dimy = 480, dimz = 400, nreps = 100;
    int pid=-1, np=-1;
    MPI Init(&argc, &argv);
    MPI Comm rank(MPI COMM WORLD, &pid);
    MPI Comm size(MPI COMM WORLD, &np);
    if(np < 3) {
        if(0 == pid) printf("Nedded 3 or more processes.\n");
        MPI Abort( MPI COMM WORLD, 1 ); return 1;
    }
    if(pid < np - 1)
        compute_node_stencil(dimx, dimy, dimz / (np - 1), nreps);
    else
        data server( dimx,dimy,dimz, nreps );
    MPI Finalize();
    return 0;
```

( Volumes are split into tiles (along the Z-axis)

- 3D-Stencil introduces data dependencies





```
void data server(int dimx, int dimy, int dimz, int nreps) {
    int np, num comp nodes = np - 1, first node = 0, last node = np - 2;
    unsigned int num points = dimx * dimy * dimz;
    unsigned int num bytes = num points * sizeof(float);
    float *input=0, *output=0;
    /* Set MPI Communication Size */
    MPI Comm size(MPI COMM WORLD, &np);
    /* Allocate input data */
    input = (float *)malloc(num bytes);
    output = (float *)malloc(num bytes);
    if(input == NULL || output == NULL) {
        printf("server couldn't allocate memory\n");
        MPI Abort( MPI COMM WORLD, 1 );
    }
    /* Initialize input data */
    random_data(input, dimx, dimy , dimz , 1, 10);
    /* Calculate number of shared points */
    int edge num points = dimx * dimy * (dimz / num comp nodes + 4);
    int int num points = dimx * dimy * (dimz / num comp nodes + 8);
    float *send address = input;
```



```
/* Send data to the first compute node */
MPI Send(send address, edge num points, MPI REAL, first node,
        DATA DISTRIBUTE, MPI COMM WORLD );
send address += dimx * dimy * (dimz / num comp nodes - 4);
/* Send data to "internal" compute nodes */
for(int process = 1; process < last node; process++) {</pre>
    MPI Send(send address, int num points, MPI REAL, process,
            DATA DISTRIBUTE, MPI COMM WORLD);
    send address += dimx * dimy * (dimz / num comp nodes);
}
/* Send data to the last compute node */
MPI_Send(send_address, edge_num_points, MPI_REAL, last_node,
        DATA DISTRIBUTE, MPI COMM WORLD);
```



```
/* Wait for nodes to compute */
MPI Barrier(MPI COMM WORLD);
/* Collect output data */
MPI Status status;
for(int process = 0; process < num comp nodes; process++)</pre>
    MPI Recv(output + process * num points / num comp nodes,
        num_points / num_comp_nodes, MPI_REAL, process,
        DATA COLLECT, MPI COMM WORLD, &status );
/* Store output data */
store output(output, dimx, dimy, dimz);
/* Release resources */
free(input);
free(output);
```



( Approach: two-stage execution

- Stage 1: compute the field points to be exchanged





# Boundary Exchange Example (II)

#### ( Approach: two-stage execution

Stage 2: Compute the remaining points *while* exchanging the boundaries





## Stencil Code: Compute Process (I)

```
void compute_node_stencil(int dimx, int dimy, int dimz, int nreps ) {
    int np, pid;
    MPI Comm rank(MPI COMM WORLD, &pid);
    MPI_Comm_size(MPI_COMM_WORLD, &np);
    unsigned int num_points = dimx * dimy * (dimz + 8);
unsigned int num_bytes = num_points * sizeof(float);
    unsigned int num ghost points = 4 * dimx * dimy;
    unsigned int num ghost bytes = num ghost points * sizeof(float);
    int left ghost offset = 0;
    int right ghost offset = dimx * dimy * (4 + dimz);
    int left stage1 offset = 0;
    int right_stage1_offset = dimx * dimy * (dimz - 4);
    int stage2 offset = num ghost points;
```



```
float *h input = NULL, *h output = NULL;
float *d input = NULL, *d output = NULL, *d vsq = NULL;
float *h_left_ghost_own = NULL, *h_right_ghost_own = NULL;
float *h left ghost = NULL, *h right ghost = NULL;
/* Alloc host memory */
h input = (float *)malloc(num bytes);
h_output = (float *)malloc(num bytes);
/* Alloc host memory for ghost data */
cudaMallocHost((void **)&h left ghost own, num ghost bytes );
cudaMallocHost((void **)&h right ghost own, num ghost bytes );
cudaMallocHost((void **)&h_left_ghost,
                                           num ghost bytes );
cudaMallocHost((void **)&h right ghost,
                                           num ghost bytes );
/* Alloca device memory for input and output data */
cudaMalloc((void **)&d input, num bytes );
cudaMalloc((void **)&d output, num bytes );
```



```
MPI Status status;
int left_neighbor = (pid > 0) ? (pid - 1) : MPI_PROC_NULL;
int right_neighbor = (pid < np - 2) ? (pid + 1) : MPI_PROC_NULL;</pre>
int server process = np - 1;
/* Get the input data from main process */
float *rcv address = h input + num ghost points * (0 == pid);
MPI Recv(rcv_address, num_points, MPI_REAL, server_process,
        DATA DISTRIBUTE, MPI COMM WORLD, &status );
cudaMemcpy(d_input, h_input, num_bytes, cudaMemcpyHostToDevice );
/* Upload stencil cofficients */
upload coefficients(coeff, 5);
/* Create streams used for stencil computation */
cudaStream t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
```



# Stencil Code: Compute Process (IV)

```
MPI Barrier( MPI COMM WORLD );
for(int i=0; I < nreps; i++) {</pre>
    /* Compute values needed by other nodes first */
    launch kernel(d output + left stage1 offset,
        d input + left stage1 offset, dimx, dimy, 12, stream1);
    launch kernel(d output + right stage1 offset,
        d input + right stage1 offset, dimx, dimy, 12, stream1);
    /* Compute the remaining points */
    launch kernel(d output + stage2_offset, d_input + stage2_offset,
                dimx, dimy, dimz, stream2);
    /* Copy the data needed by other nodes to the host */
    cudaMemcpyAsync(h left ghost own,
            d output + num ghost points,
            num ghost bytes, cudaMemcpyDeviceToHost, stream1 );
    cudaMemcpyAsync(h_right_ghost_own,
                d output + right stage1 offset + num ghost points,
                num_ghost_bytes, cudaMemcpyDeviceToHost, stream1 );
    cudaStreamSynchronize(stream1);
```



## Stencil Code: Compute Process (V)

```
/* Send data to left, get data from right */
MPI_Sendrecv(h_left_ghost_own, num_ghost_points, MPI_REAL,
            left_neighbor, i, h_right_ghost,
            num ghost points, MPI REAL, right neighbor, i,
            MPI COMM WORLD, &status );
/* Send data to right, get data from left */
MPI_Sendrecv(h_right_ghost_own, num_ghost_points, MPI_REAL,
            right neighbor, i, h left ghost,
            num ghost points, MPI REAL, left neighbor, i,
            MPI COMM WORLD, &status );
cudaMemcpyAsync(d output+left ghost offset, h left ghost,
            num ghost bytes, cudaMemcpyHostToDevice, stream1);
cudaMemcpyAsync(d output+right ghost offset, h right ghost,
            num_ghost_bytes, cudaMemcpyHostToDevice, stream1 );
cudaDeviceSynchronize();
float *temp = d output;
d output = d input; d input = temp;
```

Barcelona Supercomputing Center Centro Nacional de Supercomputación

}

```
/* Wait for previous communications */
MPI Barrier(MPI COMM WORLD);
float *temp = d output;
d output = d input;
d input = temp;
/* Send the output, skipping ghost points */
cudaMemcpy(h output, d output, num bytes, cudaMemcpyDeviceToHost);
float *send_address = h_output + num_ghost points;
MPI Send(send address, dimx * dimy * dimz, MPI REAL,
        server_process, DATA_COLLECT, MPI_COMM_WORLD);
MPI Barrier(MPI COMM WORLD);
/* Release resources */
free(h input); free(h output);
cudaFreeHost(h_left_ghost_own); cudaFreeHost(h_right_ghost_own);
cudaFreeHost(h_left_ghost); cudaFreeHost(h_right_ghost);
cudaFree( d input ); cudaFree( d output );
```





#### ( MPI for dummies

### ( MPI meets CUDA

# ( MPI and CUDA Example: 3D Stencil

( MPI and CUDA 4.0



# Without GPU Direct

( There is an internal copy (not seen by the user) between CUDA buffers and Infinibad buffers





(( There is no internal copy, increasing performance(( The program code remains unchanged





#### ( MPI Processes handle more than one GPU



#### ( Peer GPU to GPU communication without need for MPI



( MPI Processes handle more than one GPU using several CPU (host) threads





#### www.bsc.es



Barcelona Supercomputing Center Centro Nacional de Supercomputación

# Applied CUDA Programming

Lecture 6: Prefix Scan

# Objective

- ( To master parallel Prefix Sum (Scan) algorithms
  - frequently used for parallel work assignment and resource allocation
  - A key primitive to in many parallel algorithms to covert serial computation into parallel computation
  - Based on reduction tree and reverse reduction tree
- ( Reading Mark Harris, Parallel Prefix Sum with CUDA
  - http://developer.download.nvidia.com/compute/cuda/1\_1/Website/proj ects/scan/doc/scan.pdf



# (Inclusive) Prefix-Sum (Scan) Definition

**Definition:** The all-prefix-sums operation takes a binary associative operator  $\bigoplus$ , and an array of n elements

 $[x_0, x_1, \ldots, x_{n-1}],$ 

and returns the array

 $[x_0, (x_0 \oplus x_1), \dots, (x_0 \oplus x_1 \oplus \dots \oplus x_{n-1})].$ 

**Example:** If  $\oplus$  is addition, then the all-prefix-sums operation on the array [3 1 7 0 4 1 6 3], would return [3 4 11 11 15 16 22 25].



# A Inclusive Scan Application Example

- ( Assume that we have a 100-inch sausage to feed 10
- ( We know how much each person wants in inches
  - [3 5 2 7 284 30 8 1]
- ( How do we cut the sausage quickly?
- ( How much will be left
- ( Method 1: cut the sections sequentially: 3 inches first, 5 inches second, 2 inches third, etc.
- ( Method 2: calculate Prefix scan
  - [3, 8, 10, 17, 45, 49, 52, 52, 60, 61] (39 inches left)



# **Other Applications**

- ( Assigning camp slots
- ( Assigning farmer market space
- ( Allocating memory to parallel threads
- ( Allocating memory buffer for communication channels

(( ...



# **A Inclusive Sequential Prefix-Sum**

Given a sequence  $[x_0, x_1, x_2, \dots]$ Calculate output

 $[y_0, y_1, y_2, \dots]$ 

Such that

$$y_0 = x_0$$
  
 $y_1 = x_0 + x_1$   
 $y_2 = x_0 + x_1 + x_2$ 

Using a recursive definition

$$y_i = y_{i-1} + x_i$$

. . .



# A Work Efficient C Implementation

y[0] = x[0]; for (i = 1; i < Max\_i; i++) y[i] = y [i-1] + x[i];</pre>

- (Computationally efficient:
  - N additions needed for N elements O(N)!



# A Naïve Inclusive Parallel Scan

- ( Assign one thread to calculate each y element
- ( Have every thread to add up all x elements needed for the y element

$$y_0 = x_0$$
  
 $y_1 = x_0 + x_1$   
 $y_2 = x_0 + x_1 + x_2$ 

"Parallel programming is easy as long as you do not care about performance."



#### Let's Look at the Reduction Tree Again





#### **Reduction Scan Step**





#### **Inclusive Post Scan Step**





### **Inclusive Post Scan Step**





## Putting All Together





```
/* scan_array[BLOCK_SIZE] is in shared memory */
int stride = 1;
while(stride < BLOCK_SIZE)
{
    int index = (threadIdx.x+1)*stride*2 - 1;
    if(index < BLOCK_SIZE)
        scan_array[index] += scan_array[index-stride];
    stride = stride*2;
    __syncthreads();
    threadIdx.x+1 = 1, 2, 3, 4....
}
</pre>
```



## Putting All Together





```
int stride = BLOCK_SIZE >> 1;
while(stride > 0)
{
    int index = (threadIdx.x+1)*stride*2 - 1;
    if(index < BLOCK_SIZE) {
        scan_array[index+stride] += scan_array[index];
    }
    stride = stride >> 1;
    __syncthreads();
}
```



## (Exclusive) Prefix-Sum (Scan) Definition

**Definition:** The all-prefix-sums operation takes a binary associative operator  $\bigoplus$ , and an array of n elements [a0, a1, ..., an-1],

and returns the array

 $[0, a0, (a0 \oplus a1), ..., (a0 \oplus a1 \oplus ... \oplus an-2)].$ 

**Example:** If  $\oplus$  is addition, then the all-prefix-sums operation on the array [3 1 7 0 4 1 6 3], would return [0 3 4 11 11 15 16 22].



- ( To find the beginning address of allocated buffers
- ( Inclusive and Exclusive scans can be easily derived from each other; it is a matter of convenience



#### **Exclusive Post Scan Step**





#### **Exclusive Post Scan Step**





#### **Inclusive Post Scan Step**





## Exclusive Scan Example – Reduction Step

| T 3 1 | 7 | 0 | 4 | 1 | 6 | 3 |
|-------|---|---|---|---|---|---|
|-------|---|---|---|---|---|---|

Assume array is already in shared memory



#### Reduction Step (cont.)



Iteration 1, n/2 threads

Each  $\bigoplus$  corresponds to a single thread.

Iterate log(n) times. Each thread adds value *stride* elements away to its own value



#### Reduction Step (cont.)



Iteration 2, n/4 threads

Each corresponds to a single thread.

Iterate log(n) times. Each thread adds value stride elements away to its own value



## Reduction Step (cont.)



Iterate log(n) times. Each thread adds value *stride* elements away to its own value. Note that this algorithm operates in-place: no need for double buffering





We now have an array of partial sums. Since this is an exclusive scan, set the last element to zero. It will propagate back to the first element.



#### Post Scan Step from Partial Sums





#### Post Scan Step from Partial Sums (cont.)



Each corresponds to a single thread.

Iterate log(n) times. Each thread adds value *stride* elements away to its own value, and sets the value *stride* elements away to its own *previous* value.



#### Post Scan From Partial Sums (cont.)



Each corresponds to a single thread.

Iterate log(n) times. Each thread adds value *stride* elements away to its own value, and sets the value *stride* elements away to its own *previous* value.



# Post Scan Step From Partial Sums (cont.)



Done! We now have a completed scan that we can write out to device memory.

Total steps:  $2 * \log(n)$ . Total work: 2 \* (n-1) adds = O(n) Work Efficient!



# Work Analysis

- ( The parallel Inclusive Scan executes 2\* log(n) parallel iterations
  - log(n) in reduction and log(n) in post scan
  - The iterations do n/2, n/4,..1, 1, ...., n/4. n/2 adds
  - − Total adds:  $2^*$  (n-1)  $\rightarrow$  O(n) work
- ( The total number of adds is no more than twice of that done in the efficient sequential algorithm
  - The benefit of parallelism can easily overcome the 2X work





Each thread reads one value from the input array in device memory into shared memory array T0. Thread 0 writes 0 into shared memory array.  Read input from device memory to shared memory. Set first element to zero and shift others right by one.





1. (previous slide)

 Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays)

Iteration #1 Stride = 1 Active threads: *stride* to *n*-1 (*n*-*stride* threads)
Thread *j* adds elements *j* and *j*-*stride* from T0 and writes result into shared memory buffer T1 (ping-pong)





- Read input from device memory to shared memory. Set first element to zero and shift others right by one.
- Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays)







Iteration #3 Stride = 4



 Read input from device memory to shared memory. Set first element to zero and shift others right by one.

2. Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays)





- Read input from device memory to shared memory. Set first element to zero and shift others right by one.
- Iterate log(n) times: Threads stride to n: Add pairs of elements stride elements apart. Double stride at each iteration. (note must double buffer shared mem arrays)
- 3. Write output to device memory.

#### Work Efficiency Considerations

- ( The plausible parallel Scan executes log(n) parallel iterations
  - The steps do (n-1), (n-2), (n-4),..(n-n/2) adds
  - Total adds:  $n * log(n) + (n-1) \rightarrow O(n*log(n))$  work
- ( This scan algorithm is not very work efficient
  - Sequential scan algorithm does *n* adds
  - A factor of log(n) hurts: 20x for 10^6 elements!
- ( A parallel algorithm can be slow when execution resources are saturated due to low work efficiency



# Working on Arbitrary Length Input

- ( Build on the scan kernel that handles up to 2\*blockDim elements
- ( Have each section of 2\*blockDim elements assigned to each block
- ( Have each block write the sum of its section into a Sum array indexed by blockIdx.x
- ( Run parallel scan on the Sum array
  - May need to break down Sum into multiple sections if it is too big for a block
- ( Add the scanned Sum array values to the elements of corresponding sections



# **Overall Flow of Complete Scan**





# INTELLECTUAL PROPERTY RIGHTS NOTICE:

- The User may only download, make and retain a copy of the materials for his/her use for non-commercial and research purposes.
- The User may not commercially use the material, unless has been granted prior written consent by the Licensor to do so; and cannot remove, obscure or modify copyright notices, text acknowledging or other means of identification or disclaimers as they appear.
- For further details, please contact BSC-CNS patc@bsc.es

**(#**)

PATC training, Barcelona, May 2012