Can GPUs Transform Data Analytics Like They Have AI?
Updated: Aug 15
GPUs are celebrated for their speed and efficiency in handling computationally intensive tasks in machine learning (ML), artificial intelligence (AI), and graphics applications. But can they bring the same level of performance to data analytics workloads?
Understanding the Differences: ML/AI Workloads vs. Data Analytics Workloads
To explore this question, we must first understand the fundamental differences between typical ML/AI workloads and data analytics workloads.
ML/AI workloads often involve large-scale matrix operations, such as those used in neural network training and inference. These tasks are highly parallelizable because they involve repetitive computations across large datasets with relatively simple control flow. For example, training a neural network involves performing the same operations (like matrix multiplications and activation functions) across many data points.
In contrast, data analytics workloads frequently involve complex queries that include conditional statements, joins, aggregations, and other operations that depend on the data values being processed. These tasks require frequent branching and complex control flows, making them less straightforward to parallelize. For instance, consider the following SQL query:
This query introduces a code branch based on the Quantity value, resulting in different execution paths. Specifically, the WHEN clause in the SQL query translates to if-then-else statements that can be seen in this pseudo code (consider using total of 8 threads):
This creates "branch divergence," where the execution path depends on the evaluated column's value. In AI workloads, such conditional processing is minimal, but in data analytics, it occurs frequently.
Branch Divergence in Data Analytics: The Challenge for GPUs
GPUs are designed with a Single Instruction, Multiple Threads (SIMT) architecture. This means that each core within a GPU executes the same instruction simultaneously on multiple data points. The architecture is highly effective for tasks that require uniform operations across large datasets, such as matrix multiplications commonly found in ML and graphics processing.
In the SIMT model, GPU threads are organized into groups called Warps, each typically consisting of 32 threads. These threads execute instructions in lockstep, meaning that every core within a warp must execute the same instruction at the same time. If different threads within a warp need to execute different instructions (as is the case with branch divergence), some threads must wait while others complete their tasks, leading to inefficiencies.
Now let's see how GPUs run the code branch of the same query. In the below diagram, consider using a Warp of 8 threads for simplicity, you can see the lower 4 threads of the warp are executing A() function while other upper threads are idle. Later on, the 4 upper threads execute X() function and the 4 lower ones are idle. This continues for the whole flow.
According to Amdahl’s Law, if 20% of the code involves if-then-else conditions, the GPU speedup is limited to 5x. Thus, NVIDIA often reports speedups of 1.5x-3x. While CPUs handle branch divergence efficiently, they struggle with limited parallelism and high overhead from repeated instruction fetching and decoding for each data record.