Nowadays, we have to deal with massive amounts of data, so the speed of calculations is critical. Traditional CPU technology is no longer capable of meeting computational demand. Parallel computing allows us to solve this problem by using multiple computing resources simultaneously. This technology is based on the capability of the GPU to divide computing tasks into hundreds of smaller ones and run them concurrently. However, to achieve a drastic increase in computing performance, developers have to rethink their algorithms to make them parallel.
In one of our previous posts, we provided tips on using CUDA, NVIDIA’s parallel computing architecture that harnesses the power of the GPU and CPU. CUDA allows developers to parallelize and accelerate computations across separate threads on the GPU simultaneously. The CUDA architecture is widely used for many purposes: linear algebra, signal processing, image and video processing, and more. How to optimize your code to reveal the full potential of CUDA is the question we’ll investigate.
In this article, we want to share our experience using CUDA for defining the fastest way to find the max element and its index with different algorithms, comparing to std::max_element in C++. This article might be useful for developers who are looking for the fastest way to perform computations.
To better understand the capabilities of CUDA for speeding up computations, we conducted tests to compare different ways of optimizing code to find the maximum absolute value of an element in a range and its index.
We ran our tests on both the CPU and GPU using different methods of code optimization. During the CPU tests, we applied three different policies for parallel and sequential execution. When we carried out our tests on the GPU, we used a non-optimized method of algorithm execution as well as five algorithms optimized for the parallel architecture.
We conducted our tests using the following hardware and software:
- Intel Core i5-4460, 16GB RAM
- NVIDIA GeForce GTX 1060 6GB
- СUDA 10
- Visual Studio 2017
Using other hardware or a different CUDA version wouldn’t have made a significant difference.
As input data, we took a range of 250,000 float elements. We ran the search 1,000 times.
You can find the full source code of our testing project here: cuda-reduce-max-with-index.
Let’s take a look at our tests first and then compare the results.
Tips for Working with CUDA Compute Platform
For conducting our tests, we took the max_element() algorithm from the C++ standard library, which finds the maximum value in a range and runs on the CPU. We tested three options for executing this algorithm according to the execution policy that appeared in C++17 for parallel and sequential algorithm execution:
- Sequenced_policy (execution::seq) makes the algorithm execute sequentially without parallelism in the calling thread.
- Parallel_policy (execution::par) allows the parallel algorithm functions to execute in either the invoking thread or a thread implicitly created by the library. Any such invocations are indeterminately sequenced with respect to each other.
- Parallel_unsequenced_policy makes the parallel algorithm execute in an unordered but parallel fashion in unspecified threads implicitly created by the library.
Here’s how we did this:
We also ran six algorithms searching for the maximum value in a range on the GPU in ways both non-optimized and optimized for parallel execution:
This algorithm works as a standard sequential search loop that looks for the maximum value and its index.
In this case, we optimized the loop for parallel execution in multiple threads. Each thread saves the maximum value and its index in local memory during loop execution.
Here’s what the optimized GPU thread looks like:
The absolute maximum is saved to the max_out parameter using the atomicMax operation, which is separately realized for floating-point numbers, as CUDA supports this operation only for whole numbers.
The index is defined by checking if the local maximum is absolute.
The following algorithm represents a more complex variation of the previous. In this case, the atomicMax operation deals with the memory – shared by all threads in a block and offering high access speed – to which the maximum values for all threads in that block are written.
This algorithm to search for the maximum value was modified for additional parallelization of CUDA thread blocks. The loop is built taking into account the parallel execution of blocks. While the shared memory is only available within one block, the atomicMax function will return not the maximum value and its index but a range of values, one from each block used. The absolute maximum value will be selected from the range received as an output.
This algorithm was performed using synchronized data exchange between threads within a warp (which contains 32 threads) to reduce the number of atomic operations in CUDA.
Similar to the previous case, the algorithm is executed in two stages:
- Search for the maximum value in each thread and write it to local memory
- Search for the maximum value and its index inside the warp using data exchange
Let’s look at the source code for the first stage:
Here, we used the __shfl_xor_sync operation for finding the maximum value within the warp and the __ballot_sync and __shfl_sync operations for finding the index.
This algorithm can be executed with or without shared memory. In the latter case, the results are written directly to the output parameters of the atomicMax function.
We calculated the speed of execution for each algorithm with the selected input data and compared the results in the following diagram:
The columns in the diagram show the algorithm execution time in milliseconds. The first three columns from the left show execution times on the CPU. The fourth represents the time for linear execution on the GPU, without optimization. As you can see, single-threaded execution on the GPU didn’t lead to any acceleration of computations.
In contrast, the following five columns represent execution times for optimized algorithms on the GPU. All these columns reveal that CUDA acceleration brings impressive results.
However, there was no significant difference in speed among the optimized algorithms when processing our selected input data. Even the method with warp acceleration didn’t bring the expected boost in speed.
In this article, we compared execution times for algorithms to find the maximum value of an element in a range and its index. Our purpose was to understand CUDA’s GPU acceleration capabilities. In our tests, we saw no significant difference in execution times for algorithms optimized for the GPU. So we can conclude that it would be better to choose an algorithm for executing linear tasks based on parameters other than computation speed: flexibility of configuring graphics systems, complexity of the input data structure, etc.
Working with CUDA can provide an enormous increase in computational speed of the GPU, but it also requires deep knowledge of C++ programming. Apriorit has a team of dedicated kernel developers and C++ programmers who have extensive experience working on complicated projects. If you have a challenging project in mind, feel free to contact us.