Programming A GPU with CUDA

Posted by Hao Xu on April 9, 2019

CUDA Software Model

Grids, Blocks, Warps & Threads

The Hardware abstracted as a Grid of Thread Blocks, which are indexed from 0.

Blocks map to SMPs:

  • There may be more than one block in a SMP. If so, blocks are split into warps by hardware.
  • No guarantee of order of block execution
  • No communication or synchronisation between blocks

Threads map to CUDA cores (kernel)

  • executed in partitions of 32, called warps

Device & Host

Kernel calls on Host:

  • Host CUDA kernel launch syntax:
1
myKernel<<<blocksPerGrid, threadsPerBlock>>>(arguments);
  • Non-blocking unless using
1
cudaDeviceSynchronise();

Kernel functions on Device:

  • must __global__
  • only call __device__ function or __device__ __host__ function
  • data access must be on GPU memory

CUDA memory

Memory Access by abstraction:

  • Thread: Registers, Local Memory, and DRAM Memory via L2 cache and its block’s
  • Shared Memory cache -> Global Memory;
  • Constant cache -> Constant Memory
  • Read-only cache -> Read-only / Texture Memory
  • Block -> local cache (Shared Memory, Constant Cache & L1/Read-only)
  • Grid -> Main DRAM memory (Global Memory, Constant Memory & Read-only / Texture Memory)

Global Memory & Memory Copying

Dynamic global memory :

int main(void) {
        float * a, d_a;

        # allocate on CPU
        a = (float *)malloc(N*sizeof(float));

        # allocate and free memory on CUDA
        cudaMalloc((void **)&d_a, N*sizeof(float));
        cudaFree(d_a);

        # copying
        cudaMemcpy(dest, start, size, cudaMemcpyHostToDevice / cudaMemcpyDeviceToHost);
    
    # free on CPU & GPU
    free(a);
    cudaFree(d_a);

    return 0;
}

Statical global memory:

1
2
3
4
5
6
7
8
9
__device__ float d_a[N]

int main(void) {
        # copying
        cudaMemcpyToSymbol(dest, start, size)    # cpu to cuda
        cudaMemcpyFromSymbol(dest, start, size)  # cuda to cpu
        
        return 0;
}

Constant Memory

Constant Memory is set at runtime and read through the per SM Constant Cache. When using correctly, only 1/16 of the traffic compared to global loads:

  1. Small amounts of read-only data

  2. values are broadcast to threads in half warp (groups of 16 threads)

  3. very fast when cache hit; very slow when no cache hit.

Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.

1
2
3
4
5
6
7
8
9
10
11
12
__constant__ int my_const[16];

__global__ void add() {
    int i = blockIdx.x;
    int value = my_const[i%16];  # all threads in this block will be the same
}

int main(void) {
    add<<<blocksPerGrid, 16>>>();
        
    return 0;
}

Texture and Read-only Memory

They are unified after Kepler. There are two methods for utilising Read-only Memory / Texture Memory.

Memory bandwidth is the rate at which data can be read from or stored into a semiconductor memory by a processor. Memory bandwidth is usually expressed in units of bytes/second, though this can vary for systems with natural data sizes that are not a multiple of the commonly used 8-bit bytes.

Shared Memory

Shared Memory are only accessible from within device functions