Programming A GPU with CUDA

CUDA Software Model

Grids, Blocks, Warps & Threads

The Hardware abstracted as a Grid of Thread Blocks, which are indexed from 0.

Blocks map to SMPs:

There may be more than one block in a SMP. If so, blocks are split into warps by hardware.
No guarantee of order of block execution
No communication or synchronisation between blocks

Threads map to CUDA cores (kernel)

executed in partitions of 32, called warps

Device & Host

Kernel calls on Host:

Host CUDA kernel launch syntax:

myKernel<<<blocksPerGrid, threadsPerBlock>>>(arguments);

Non-blocking unless using

cudaDeviceSynchronise();

Kernel functions on Device:

must __global__
only call __device__ function or __device__ __host__ function
data access must be on GPU memory

CUDA memory

Memory Access by abstraction:

Thread: Registers, Local Memory, and DRAM Memory via L2 cache and its block’s
Shared Memory cache -> Global Memory;
Constant cache -> Constant Memory
Read-only cache -> Read-only / Texture Memory
Block -> local cache (Shared Memory, Constant Cache & L1/Read-only)
Grid -> Main DRAM memory (Global Memory, Constant Memory & Read-only / Texture Memory)

Global Memory & Memory Copying

Dynamic global memory :

int main(void) {
        float * a, d_a;

        # allocate on CPU
        a = (float *)malloc(N*sizeof(float));

        # allocate and free memory on CUDA
        cudaMalloc((void **)&d_a, N*sizeof(float));
        cudaFree(d_a);

        # copying
        cudaMemcpy(dest, start, size, cudaMemcpyHostToDevice / cudaMemcpyDeviceToHost);
    
    # free on CPU & GPU
    free(a);
    cudaFree(d_a);

    return 0;
}

Statical global memory:

__device__ float d_a[N]

int main(void) {
        # copying
        cudaMemcpyToSymbol(dest, start, size)    # cpu to cuda
        cudaMemcpyFromSymbol(dest, start, size)  # cuda to cpu
        
        return 0;
}

Constant Memory

Constant Memory is set at runtime and read through the per SM Constant Cache. When using correctly, only 1/16 of the traffic compared to global loads:

Small amounts of read-only data
values are broadcast to threads in half warp (groups of 16 threads)
very fast when cache hit; very slow when no cache hit.

Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

__constant__ int my_const[16];

__global__ void add() {
    int i = blockIdx.x;
    int value = my_const[i%16];  # all threads in this block will be the same
}

int main(void) {
    add<<<blocksPerGrid, 16>>>();
        
    return 0;
}

Texture and Read-only Memory

They are unified after Kepler. There are two methods for utilising Read-only Memory / Texture Memory.

Texture Memory Binding
Read-only Memory

Memory bandwidth is the rate at which data can be read from or stored into a semiconductor memory by a processor. Memory bandwidth is usually expressed in units of bytes/second, though this can vary for systems with natural data sizes that are not a multiple of the commonly used 8-bit bytes.

Shared Memory

Shared Memory are only accessible from within device functions