Sequential algorithms specify a sequence of steps that each perform one operation, but Parallel algorithms perform multiple operations per step. There are two core types of problems that are solved by parallel algorithms, transformations & reductions. Generally, the approach to transformation problems is to assign one thread per output due to the fact that these problems generally have input size equal to output size. Reduction problems are fundamentally different & generally more difficult to solve because they reduce the size of the input to produce the output.
Reduction problems run into the scenario in which many threads need to modify the same memory but this cannot happen successfully at the same time. Nvidia GPUs have many instructions that facilitate atomic operations. These instructions are limited to int, unsigned, float, etc… data types. Atomic operations happen by a special mechanism in the L2 cache. Atomic operations return the old value before the atomic update & are RMW operations.
The most basic approach for parallel redutions is a tree-based method. The way to accomplish this is through global sychronization, i.e. synchronization across the entire grid. One way to achieve this is to take advantage of CUDAs default behavior on kernel launch which is to synchronize globally. Other methods are threadblock level reduction, threadblock draining, & cooperative groups which provide a set of primitives for thread cooperation & decomposition. If our goal is to sum the values in an array, we can represent that operation in a tree.
Looking at the tree, we can come up with a very simple method for the sub problems that fit within a thread block known as a parallel sweep reduction for which an algorithm is given with a thread diagram. Sequential addressing is achieved here by the threads which means this algorithm is bank conflict free. Sequential addressing occurs when threads read & write from shared memory at each index one thread next to the other which means that warps & blocks can share retrieved sectors.
Sequential addressing gets us thread block level sums, but we’d like to be able to scale sequential addressing to some arbitrarily large dataset efficiently while maintaining coalesced loads & stores & efficient shared memory usage. The way to generalize to arbitrarily large data is through the use of a grid stride loop which assigns threads to the same memory locations over the entire global data array.
Parallel reductions use shared memory for intra thread communication. This means that each thread has to read & write from shared memory, but what if threads could communicate directly?
It turns out there is a mechanism for this called the Warp Shuffle in which threads in a warp can communicate directly with each other. There are a few intrinsic operations that exist to facilitate the warp shuffle.
Source & destination threads need to participate, there is a mask parameter used in these operations that selects which warp lanes will participate. Warp shuffle can reduce or eliminate shared memory usage which can be an occupancy bottleneck, it can reduce the number of instructions used, & reduce the level of explicit synchronization. There are other tools available, e.g. a value can be broadcast to all threads in a warp via a single instruction, warp level prefix sums & sorting are possible, & atomic aggregation is possible, i.e. instead of 32 atomic operations, we can aggregate across a warp & then have one thread perform the atomic operation as shown below.
Stencil operations are common in scientific calculations & are also known as window opertations. 1D Stencils have radii on either side of their center which is not included in those radii. The stencil slides across the input transforming or reducing it into the output.
Shared memory is physically included on the chip die & is faster than global memory which is DRAM attached to the GPU. DRAM access is by memory segments, but shared memory access is at the byte level. Shared memory is a.k.a. on-chip memory & is a block level resource. GPUs also have on-chip caches which are a separate concept. One way to accelerate any computation that has data reuse, e.g. sliding a 1D stencil across some array, is to use memory that is at a higher level in the memory hierarchy a.k.a. closer to the processor which would take some number of redundant data access & move it closer to the calculation. So, we can use shared memory, denoted by the ==_shared_ attribute, as a block level cache. When we load our block input into shared memory, we need to include the left & right halos that the stencil needs for each output element. Then, each block needs blockDim.x + 2 * radius input elements loaded into shared memory from global memory. This concept is commonly implemented in what is called a stencil kernel==, e.g. the following.
You can see from the stencil kernel that there is no guarantee about which order or times threads will execute at, so we have to ensure all shared memory has been allocated. The command, syncthreads(), is a block level synchronization method known as a barrier that forces all threads to reach it before any may proceed. Usually, syncthreads() is not called within conditional logic.