Tips for Working with CUDA Compute Platform

CUDA is a parallel computing platform developed by Nvidia for its graphics processing units. By using CUDA API, developers can retool GPUs to perform general purpose calculations. GPUs excel in algorithms that require processing large amount of data in parallel chunks. Thus, CUDA-based solutions are well suited for various applications regarding big data and research projects.

Not pretending to be a complete CUDA programming guide, this article deals with non-trivial aspects and possible pitfalls of working with CUDA for tasks that use the computational capabilities of Nvidia Nvidia GPUs. As a software R&D company, Apriorit has encountered and handled all of these CUDA programming issues in the projects we developed for our clients. Thus, we decided to share our practical experience working with CUDA and provide some CUDA programming examples with code explained.

CMake Features for CUDA

CMake has built-in support for CUDA. Typically, files containing CUDA source code have the *.cu extension. These files store kernel function instructions for compilation with NVCC. However, using this extension isn’t always convenient. CMake allows you to enable NVCC to support certain files with non-standard extensions by setting parameters of the source files:

C++

file(GLOB AlgorithmCpp algorithm/*.cpp)
  
set_source_files_properties(${AlgorithmCpp} PROPERTIES CUDA_SOURCE_PROPERTY_FORMAT OBJ)

Exceptions Handling and Behavior of the Driver When Errors Occur

Error handling in CUDA is implemented as function return codes of type cudaError_t. The return codes of NPP library functions work in a similar manner. This functionality along with the ability to get an error description string enabled implementation of the CudaException and NppException classes in the exceptxx library.

With CUDA function calls, everything is usually clear (see CUDA tutorial). The return code needs to be checked (for exceptxx, the CHECK_CUDA and CHECK_NPP macros). However, in the case of a custom global function there is no return code. For such cases, previous return codes can be retrieved with the cudaGetLastError() function (and the corresponding CHECK_LAST_CUDA macro in exceptxx). In this way, the successful execution of custom functions can be verified:

C++

cudaInitPrices<<<CUBLOCKS, CUTHREADS>>>(price,        
     srcSlice.mask, srcSlice.image.plain[0],        
     dstSlice.mask, dstSlice.image.plain[0],        
     srcOffset, dstOffset,        
     priceStartPoint, priceEndPoint); 
CHECK_LAST_CUDA();

Nevertheless, in some cases an error that happens during execution may cause not only an exception but also a driver crash accompanied by a flashing screen. Most frequently such errors occur when, in one way or another, custom CUDA functions exceed allocated memory. Such errors can take a while to reveal themselves since CUDA functions are executed asynchronously and incorrect addressing will emerge only during the next call to the CUDA function. In addition, the weakly ordered memory model used in CUDA can cause such situations.

Related services

Kernel and Driver Development

Multithreading

At the highest level of abstraction, a CUDA programmer works with a parallel system that has a SIMT (Single-Instruction, Multiple-Thread) architecture. CUDA parallel computing architecture allows one command to be executed by several more or less independent threads. A set of threads executed under one task is called a grid. Threads are grouped into warps (each warp comprises 32 threads), which, in turn, are grouped into larger entities called blocks.

All threads of a block are run on a single streaming multiprocessor, which is composed of scalar processors. However, several scalar processor blocks can share the resources of one streaming multiprocessor. CPU time is allocated in such a way that at any given moment all cores of a multiprocessor can process only one warp. Thus, threads belonging to one warp are synchronized at the CUDA hardware level. At the same time, threads belonging to different warps within one block can be out of sync.

In CUDA, threads are organized in blocks. A unique identifier is assigned to each thread that executes a kernel function. This identifier is accessible within the kernel via the built-in threadIdx variable. For the sake of convenience, the threadIdx variable is a three-component vector so threads can be identified using a one-, two-, or three-dimensional thread index.

Similarly to threads, blocks are identified with the help of the blockIdx variable. In the bounds of one block there’s a limit to the number of threads. For present GPUs, this limit is 1024.

Threads and blocks can be of type int (as in the example above) or dim3 (for the multidimensional threadIdx and blockIdx variables).

To call a kernel function, it’s necessary to define the number of blocks and threads that will be used for execution, set off by triple angle brackets (<<< >>>). Also, it’s possible to allocate an amount of shared memory and define the stream index for out-of-sync work.

C++

#define CUTHREADS 1024  //number of CUDA theards
#define CUBLOCKS 128    //number of CUDA blocks
  
  
cudaFindSkyAndGroundPoints<<<CUBLOCKS, CUTHREADS>>>(
    srcMask, dstMask, boundingBox,
    dstYawDiff,
    points);
CHECK_LAST_CUDA();

Automatic indexing is a convenient way to build cycles inside kernel functions and to address memory:

C++

static __global__ void cudaApplyWatermark(
    Cuda::BitmapView3<float> output,
    const Cuda::BitmapView2<Npp8u> watermark)
{
    for (int y = blockIdx.x; y < output.plain[0].size.height; y += gridDim.x)
    {
        for (int x = threadIdx.x; x < output.plain[0].size.width; x += blockDim.x)
        {
            float alpha = watermark.plain[1].at(x, y) / 255.0f;
            output.plain[0].at(x, y) = output.plain[0].at(x, y) * (1 - alpha) + watermark.plain[0].at(x, y) * alpha;
            output.plain[1].at(x, y) = (output.plain[1].at(x, y) - 128) * (1 - alpha) + 128.0;
            output.plain[2].at(x, y) = (output.plain[2].at(x, y) - 128) * (1 - alpha) + 128.0;
        }
    }
}

Synchronization Methods

Interaction between threads of the same block should occur via their fast shared memory (local variables of CUDA functions are created using the __shared__ variable declaration specifier).

Most often, synchronization issues emerge within one block when shared memory is accessed. To synchronize threads belonging to the same CUDA block, a barrier __syncthreads() synchronization function is used. This function keeps each thread waiting until other threads of the block reach the point where the __syncthreads() function has been executed and all shared and global memory access operations run by the threads of this block have been completed. It’s not recommended to place this function inside the if statement. Care must be taken to see that any thread can call this function unconditionally. This function is useful after accessing CUDA shared memory:

C++

//synchronization after writing to CUDA shared memory 
__shared__ int isFirstPointSet;
  
//---
  
bool isFirstPoint = false;
  
if (both && (none || srcOnly && dstOnly))
{
    isFirstPoint = !atomicCAS(&isFirstPointSet, false, true);
      
    if (isFirstPoint)
    {
        firstPoint = bothPoint;
    }
}
  
__syncthreads();
  
  
int squareDistance = -1;
  
if (both && (none || srcOnly && dstOnly) && !isFirstPoint)
{
    squareDistance = firstPoint.squareDistance(bothPoint);
}

By default, if the stream number is not defined explicitly or if the 0 index is specified during execution of the kernel function (the fourth parameter in the triple angle brackets), all functions will be executed consecutively. In CUDA, kernel functions can be executed out of sync in a certain cudaStream. If such functions are executed in cycles (each function in a separate cudaStream), but data has to be exchanged between them or between the stream and the host, then they should be synchronized through the cudaStreamSynchronize() function call. This function takes the stream index and waits until it ends.

There’s also an ultimate synchronization method that uses the cudaDeviceSynchronize() function. This function waits until execution of all previous commands in all threads of all streams has completed. If output results are required for the next kernel function to be executed, an additional synchronization through the cudaDeviceSynchronize() function is no longer necessary, and will only considerably slow down the processing time:

cudaFindSkyAndGroundPoints<<<CUBLOCKS, CUTHREADS>>>(
    srcMask, dstMask, boundingBox,
    dstYawDiff,
    points);
CHECK_LAST_CUDA();
  
cudaSelectSkyAndGroundPoints<<<1, CUBLOCKS * 2>>>(
    points,
    skyMask, skyYawDiff,
    outSkyPoint, outGroundPoint);
CHECK_LAST_CUDA();

Typically, the cudaDeviceSynchronize() function is used during debugging of kernel function code.

Atomic operations provide a convenient synchronization method. These operations are executed in isolation from other threads so operands remain unaltered during execution.

The following are the main atomic functions:

Arithmetic functions: atomicAdd, atomicSub, atomicExch, atomicMax, atomicInc, and atomicDec.

//atomic Add function
int atomicAdd(int* address, int val);

The atomicCAS function: Compare and Store.

int atomicCAS(int* address, int compare, int val);

Bit operations: atomicAnd, atomicOr, and atomicXor.

int atomicAnd(int* address, int val);

Atomic functions are designed to work only with integer arguments. However, it’s possible to work with floating point numbers. Prior to doing so, however, it’s necessary to study the behavior of such an implementation in detail to reveal potential issues.

//fatomicMin
__device__ float fatomicMin(float *addr, float value)
{
    float old = *addr, assumed;
  
    if(old <= value)
    {
        return old;
    }
    do
    {
        assumed = old;
        old = atomicCAS((unsigned int*)addr, __float_as_int(assumed), __float_as_int(value));
    }
    while(old!=assumed)
  
    return old;
}

Working with CUDA Memory

A GPU has several levels of memory, each of which has its own clear read/write characteristics. Every thread has local memory at its disposal that can’t be accessed by other threads. Data exchange between threads happens through the shared memory of a block these threads belong to. Moreover, all threads have access to the global memory. In this way, allocation is performed with the help of cudaMalloc() and other functions.

As mentioned above, memory addressing in CUDA should be treated with care, as in some cases possible issues will make themselves known at once (e.g. with a flashing screen), while in other cases they’ll need special conditions to reveal themselves. For example, addressed memory can have some junk stored in it that won’t cause an error at the moment but will evoke an error or uncontrolled behavior when attempting to work with this data in the future. When working with memory and error handling, it’s useful to keep in mind that CUDA has a weakly ordered memory model. This means that the order in which a thread writes to memory can differ from the order in which another thread reads from it. This can cause misinterpretations in certain situations. Therefore, it’s important to take into account the possibility of memory ordering introducing some errors. Here’s a basic example of how a first thread is executing the writeXY command and a second is executing readXY:

//weakly ordered model
__device__ volatile int X = 1, Y = 2;
__device__ void writeXY()
{
    X = 10;
    Y = 20;
}
__device__ void readXY()
{
    int A = X;
    int B = Y;
}

After the second thread completes execution, there’s a possibility that B = 20 and A = 1. However, when using a strongly ordered memory model, only the following options are possible:

A = 1, B = 2
A = 10, B = 2
A = 10, B = 20

Debugging Methods

Out-of-sync execution of a kernel function can complicate the debugging process – namely, finding places in the code where an error occurs. Using the CHECK_LAST_CUDA function after kernel function execution can be to no avail. However, this function will be executed during another call and verification. For this reason, while looking for errors after a potentially problematic function, the cudaDeviceSynchronize() function call should be added for the purpose of debugging. This function also returns cudaError_t, and the results of its work can be checked:

cudaCalcPrices<<<1, CUTHREADS>>>(price,
    priceStartPoint, priceEndPoint);
CHECK_LAST_CUDA();
CHECK_CUDA(cudaDeviceSynchronize());

Using the debug output, it’s possible to check the intermediate results of kernel function execution through the printf function supported in CUDA. After the function call, the output is displayed in chunks. Sometimes, if there is a large amount of output, some chunks of data can be lost or simply overwritten. Therefore the printf function should be called only in the null thread:

if (threadIdx.x == 0)
{
    printf("Debug info\n");
}

NVCC Compiler

Compilation Speed

While building a project, it’s important to keep in mind the speed of the compiler. NVCC doesn’t support compilation of precompiled headers, which leads to waiting for each file to be fully compiled. Therefore, NVCC has a pretty slow compilation speed.

IntelliSence and Visual Assist Errors

The IntelliSense code compilation tool that’s built into Microsoft Visual Studio and the third-party Visual Assist do not support the *.cu extension. In addition, there’s no automatic keyboard input supplementation or syntax highlighting. The CMake approach described above solves these issues.

A global kernel function call is performed by setting CUDA launch parameters encompassed by triple angle brackets (e.g., <<< a list of parameters >>>).

cudaCalcPrices<<<1, CUTHREADS>>>(price,
    priceStartPoint, priceEndPoint);
CHECK_LAST_CUDA();

In Visual Assist, code auto-formatting often corrupts the formatting in lines with calls that have triple angle brackets. For example:

cudaCalcPrices < <<1, CUTHREADS >> > (price,
    priceStartPoint, priceEndPoint);
CHECK_LAST_CUDA();

In addition, for the IntelliSense and Visual Assist parsers, triple angle brackets can break down syntax highlighting and often the ability to use shortcut key combinations (e.g. Ctrl+} to quickly move from the opening to closing brace or vice versa, or Alt+G to quickly move to the implementation in Visual Assist).

To fix this, all Visual Studio copies that are currently running should be restarted. However, in more complicated cases, the Visual Assist cache must be cleared.

To clear the Visual Assist cache, do the following:

Open the VASSISTX menu.
Click Visual Assist Options.
In the Visual Assist Options window, click Performance.
In the Performance tab, click the Clear button.
Restart Microsoft Visual Studio.

Conclusion

CUDA is an effective and powerful tool that allows developers to squeeze out more functionality from the beefed up GPUS, currently available on the market. However, it can be tricky to work with. Developer need a solid knowledge of C++ programming, a great understanding of GPU design and an experience with CUDA API, including the knowledge of all the weird quirks that arise due to an unconventional use of hardware.

We hope that the article above helped to clarify some of the basics of working with CUDA. And if you need a CUDA developers for your project, you can always send us your request for proposal – we will get back to you right away.