Nvidia introduced their new general purpose parallel computing architecture called CUDA in year 2006. Which provides a new parallel programming model and instruction set architecture for Nvidia GPUs. Also it comes with a software environment that allows programmer to use C as high level programming language and solve computationally demanding problems in a more efficient way.
A hierarchy of thread groups, barrier synchronization and shared memories are the three key abstractions provided by CUDA that are simply exposed to the developer as a minimal set of language extensions. They give thread parallelism and fine-grained data parallelism, nested within task parallelism and coarse grained data parallelism. And also this abstractions helps the developer to partition the task into coarse subtasks which can be solved independently in parallel, and then into finer pieces that can be solved cooperatively in parallel.
With CUDA large numbers of processor cores can be used to transparently scale the programming model. That means, if the runtime system knows the physical processor count then CUDA program can run on any number of processors.
CUDA program contains serial program part called kernel. CUDA kernel represents the operations which are executed by a single thread and because of that it can execute via set of threads in parallel manner. Each tread is given a unique thread ID which is accessible within the kernel through the built-in variable called threadIdx. CUDA arrange this treads in to a hierarchy called blocks and grids. Block contains set of independent threads and grid contains set of independent thread blocks. As the thread ID, blocks inside a grid can be
identified using a built-in block ID variable called blockIdx.
During the execution CUDA threads may use several memory spaces. Each thread has its own private local memory and when considering about blocks each thread block has a shared memory space which is visible to all threads of the block. Also there is a global memory which is accessible to all threads. There are other two types of read-only memory spaces called constant memory and texture memory. Constant memory is consisting of very limited size and cache. Texture memory is large and cached. Also reading from texture memory is generally takes less amount time than reading from local or global memory. When considering about particular application global, constant and texture memory spaces are persistent across its kernels.
At the hardware level the Nvidia GeForce 8800 GTX processor can be considered as a collection of 16 multipro- cessors, with 8 scalar processor (SP) cores in each. Figure 1 shows general view of CUDA hardware interface. Each multiprocessor consists of its own shared memory and which is visible to all 8 processors inside. These multiprocessors also have set of 32-bit registers, texture and constant memory caches.
When managing hundreds of threads, multiprocessor maps each thread to one scalar processor core. Which employs a new architecture called single-instruction multiple-thread (SIMT) and it makes each scalar processor core a SIMT processor. Device memory is available to all the processors and it allows communicating between multiprocessors.
- NVIDIA CUDA Programming Guide Version 2.3.1, 2009.