Compute Unified Device Architecture CUDA

Published on Aug 15, 2016

Abstract

CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA. CUDA is the computing engine in NVIDIAgraphics processing units (GPUs) that is accessible to software developers through variants of industry standard programming languages.

Programmers use 'C for CUDA' (C with NVIDIA extensions and certain restrictions), compiled through a PathScale Open64 C compiler to code algorithms for execution on the GPU. CUDA architecture shares a range of computational interfaces with two competitors -the Khronos Group's Open Computing Language and Microsoft's DirectCompute. Third party wrappers are also available for Python, Perl, Fortran, Java, Ruby, Lua, MATLAB and IDL.

CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. Using CUDA, the latest NVIDIA GPUs become accessible for computation like CPUs. Unlike CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very fast. This approach of solving general purpose problems on GPUs is known as GPGPU

Introduction of CUDA

• CUDA is a platform for performing massively parallel computations on graphics accelerators

• CUDA was developed by nVidia Corporation

• It was first available with their G8X line of graphics cards

• Approximately 1 million CUDA capable GPUs are shipped every week

• CUDA presents a unique opportunity to develop widely-deployed parallel applications

Computing is evolving from "central processing" on the CPU to "co-processing" on the CPU and GPU. To enable this new computing paradigm, NVIDIA invented the CUDA parallel computing architecture that is now shipping in GeForce, ION, Quadro, and Tesla GPUs, representing a significant installed base for application developers.

The CUDA platform represents the shift from traditional clock speed intensive processing to distributed stream processing

Implementations :

There are two levels for the runtime API. The C API (cuda_runtime_api.h) is a C-style interface that does not require compiling with nvcc.

The C++ API (cuda_runtime.h) is a C++-style interface built on top of the C API. It wraps some of the C API routines, using overloading, references and default arguments. These wrappers can be used from C++ code and can be compiled with any C++ compiler. The C++ API also has some CUDA-specific wrappers that wrap C API routines that deal with symbols, textures, and device functions. These wrappers require the use of nvcc because they depend on code being generated by the compiler. For example, the execution configuration syntax to invoke kernels is only available in source code compiled with nvcc

The next generation CUDA architecture (codename: "Fermi") which is standard on NVIDIA's released GeForce 400 Series GPU is designed from the ground up to natively support more programming languages such as C++. It has eight times the peak double-precision floating-point performance compared to Nvidia's previous-generation Tesla GPU. It also introduced several new features including:

• up to 512 CUDA cores and 3.0 billion transistors

• NVIDIA Parallel DataCache technology

• NVIDIA GigaThread engine

• ECC memory support

• Native support for Visual Studio

SeminarTopics.co.in