CUDA Software Model
Grids, Blocks, Warps & Threads
The Hardware abstracted as a Grid of Thread Blocks, which are indexed from 0.
Blocks map to SMPs:
- There may be more than one block in a SMP. If so, blocks are split into warps by hardware.
- No guarantee of order of block execution
- No communication or synchronisation between blocks
Threads map to CUDA cores (kernel)
- executed in partitions of 32, called warps
Device & Host
Kernel calls on Host:
- Host CUDA kernel launch syntax:
1
myKernel<<<blocksPerGrid, threadsPerBlock>>>(arguments);
- Non-blocking unless using
1
cudaDeviceSynchronise();
Kernel functions on Device:
- must __global__
- only call __device__ function or __device__ __host__ function
- data access must be on GPU memory
CUDA memory
Memory Access by abstraction:
- Thread: Registers, Local Memory, and DRAM Memory via L2 cache and its block’s
- Shared Memory cache -> Global Memory;
- Constant cache -> Constant Memory
- Read-only cache -> Read-only / Texture Memory
- Block -> local cache (Shared Memory, Constant Cache & L1/Read-only)
- Grid -> Main DRAM memory (Global Memory, Constant Memory & Read-only / Texture Memory)
Global Memory & Memory Copying
Dynamic global memory :
int main(void) {
float * a, d_a;
# allocate on CPU
a = (float *)malloc(N*sizeof(float));
# allocate and free memory on CUDA
cudaMalloc((void **)&d_a, N*sizeof(float));
cudaFree(d_a);
# copying
cudaMemcpy(dest, start, size, cudaMemcpyHostToDevice / cudaMemcpyDeviceToHost);
# free on CPU & GPU
free(a);
cudaFree(d_a);
return 0;
}
Statical global memory:
1
2
3
4
5
6
7
8
9
__device__ float d_a[N]
int main(void) {
# copying
cudaMemcpyToSymbol(dest, start, size) # cpu to cuda
cudaMemcpyFromSymbol(dest, start, size) # cuda to cpu
return 0;
}
Constant Memory
Constant Memory is set at runtime and read through the per SM Constant Cache. When using correctly, only 1/16 of the traffic compared to global loads:
-
Small amounts of read-only data
-
values are broadcast to threads in half warp (groups of 16 threads)
-
very fast when cache hit; very slow when no cache hit.
Cache hits are served by reading data from the cache, which is faster than recomputing a result or reading from a slower data store; thus, the more requests that can be served from the cache, the faster the system performs.
1
2
3
4
5
6
7
8
9
10
11
12
__constant__ int my_const[16];
__global__ void add() {
int i = blockIdx.x;
int value = my_const[i%16]; # all threads in this block will be the same
}
int main(void) {
add<<<blocksPerGrid, 16>>>();
return 0;
}
Texture and Read-only Memory
They are unified after Kepler. There are two methods for utilising Read-only Memory / Texture Memory.
- Texture Memory Binding
- Read-only Memory
Memory bandwidth is the rate at which data can be read from or stored into a semiconductor memory by a processor. Memory bandwidth is usually expressed in units of bytes/second, though this can vary for systems with natural data sizes that are not a multiple of the commonly used 8-bit bytes.
Shared Memory
Shared Memory are only accessible from within device functions