Can GPUs Transform Data Analytics Like They Have AI?
Updated: Jan 15
GPUs are celebrated for their speed and efficiency in handling computationally intensive tasks in machine learning (ML), artificial intelligence (AI), and graphics applications. But can they bring the same level of performance to data analytics workloads?
Understanding the Differences: ML/AI Workloads vs. Data Analytics Workloads
To explore this question, we must first understand the fundamental differences between typical ML/AI workloads and data analytics workloads.
ML/AI workloads often involve large-scale matrix operations, such as those used in neural network training and inference. These tasks are highly parallelizable because they involve repetitive computations across large datasets with relatively simple control flow. For example, training a neural network involves performing the same operations (like matrix multiplications and activation functions) across many data points.
In contrast, data analytics workloads frequently involve complex queries that include conditional statements, joins, aggregations, and other operations that depend on the data values being processed. These tasks require frequent branching and complex control flows, making them less straightforward to parallelize. For instance, consider the following SQL query:

This query introduces a code branch based on the Quantity value, resulting in different execution paths. Specifically, the WHEN clause in the SQL query translates to if-then-else statements that can be seen in this pseudo code (consider using total of 8 threads):

This creates "branch divergence," where the execution path depends on the evaluated column's value. In AI workloads, such conditional processing is minimal, but in data analytics, it occurs frequently.
Branch Divergence in Data Analytics: The Challenge for GPUs
GPUs are designed with a Single Instruction, Multiple Threads (SIMT) architecture. This means that each core within a GPU executes the same instruction simultaneously on multiple data points. The architecture is highly effective for tasks that require uniform operations across large datasets, such as matrix multiplications commonly found in ML and graphics processing.
In the SIMT model, GPU threads are organized into groups called Warps, each typically consisting of 32 threads. These threads execute instructions in lockstep, meaning that every core within a warp must execute the same instruction at the same time. If different threads within a warp need to execute different instructions (as is the case with branch divergence), some threads must wait while others complete their tasks, leading to inefficiencies.
Now let's see how GPUs run the code branch of the same query. In the below diagram, consider using a Warp of 8 threads for simplicity, you can see the lower 4 threads of the warp are executing A() function while other upper threads are idle. Later on, the 4 upper threads execute X() function and the 4 lower ones are idle. This continues for the whole flow.

According to Amdahl’s Law, if 20% of the code involves if-then-else conditions, the GPU speedup is limited to 5x. Thus, NVIDIA often reports speedups of 1.5x-3x. While CPUs handle branch divergence efficiently, they struggle with limited parallelism and high overhead from repeated instruction fetching and decoding for each data record.
The Need for Specialized Architectures: Coarse-Grained Reconfigurable Architecture (CGRA)
To optimize performance for data analytic workloads, a new system architecture, Coarse-Grained Reconfigurable Architecture (CGRA), is emerging as a promising solution. A CGRA can be reconfigured for specific computational tasks, optimizing data flow through a configurable pipeline of compute elements. This architecture addresses the shortcomings of GPUs, particularly when it comes to dealing with branch divergence.
Branch Divergence and CGRA: A Detailed Look
As discussed earlier, data analytics workloads often involve complex queries with conditional statements that lead to branch divergence. Let’s revisit our SQL (pseudo code) example:

On a GPU, the threads executing the A() and B() functions for quantities greater than 30 might be active, while others executing X() and Y() for quantities under 30 are idle, and vice versa. This idle time is a significant source of inefficiency.
How CGRA Addresses Branch Divergence
CGRA mitigates these inefficiencies through a few key mechanisms:
Reconfigurable Compute Elements:
CGRA comprises an array of configurable compute elements that can be dynamically reprogrammed to handle different parts of a workload. Unlike GPUs, which execute a single instruction across multiple threads, CGRA elements can be configured to execute different instructions simultaneously. This flexibility allows CGRA to handle complex control flows and conditional statements more efficiently.
Optimized Data Flow:
In a CGRA, data flows through a configurable pipeline where each stage can be tailored to the specific computational task. This approach minimizes idle time by ensuring that compute elements are always active, processing different parts of the workload in parallel. For example, while one set of elements processes the A() function for high quantities, another set can simultaneously handle the X() function for lower quantities.
Parallel Pipeline Processing:
CGRA enables true parallelism at a coarse-grained level. Imagine a pipeline where the first record enters the pipeline at the top, and each subsequent record enters after a short delay. Each stage of the pipeline performs a specific computation, and once a computation is complete, the data moves to the next stage. This way, multiple records are processed simultaneously at different stages, ensuring that no compute element remains idle.
Maximize HW efficiency:
For our SQL example, one part of the pipeline can handle the logic for Quantity > 30, while another part simultaneously processes Quantity <= 30. This eliminates the idle time seen in GPU architectures, where some threads wait for others to finish divergent tasks.
Reduced Control Overhead:
CGRA reduces the overhead associated with branch divergence by distributing the control logic across the reconfigurable elements. Each element can independently execute its branch of the code, reducing the need for synchronization and coordination that causes delays in GPU-based systems.
Example: SQL Query Execution on CGRA
Consider the SQL query example again, but this time executed on a CGRA:
Initial Stage:
The first stage of the pipeline reads the Quantity value and determines the execution path (greater than or less than 30).
Parallel Execution:
If the quantity is greater than 30, the record is sent down one path of the pipeline where A() and B() functions are executed.
If the quantity is less than or equal to 30, it is sent down a different path where X() and Y() functions are executed.
Convergence:
Finally, all paths converge, and the Z() function constructs the final output. This ensures that all compute elements remain active and efficient throughout the process.
By configuring the pipeline stages to handle different branches of the code simultaneously, CGRA eliminates the idle time that plagues GPUs in similar scenarios. This results in significant performance gains, especially for data analytics workloads with frequent branching.
These stages can be explained in the following diagram where CGRA pipeline processes multiple records in parallel, minimizing idle time:

Speedata's Innovation: A New Processor Based on CGRA
Here at Speedata, we have harnessed the power of CGRA to develop a new processor specifically designed to achieve maximum efficiency for processing data analytics workloads. Our processor leverages the advanced features of CGRA to handle complex queries with minimal idle time, offering an order-of-magnitude performance advantage over GPUs in industry benchmarks like TPC-DS. To learn more about our innovative solutions and how they can transform your data processing capabilities, visit our website.
Conclusion
While GPUs excel in parallel computational tasks like ML and graphics, their performance in data analytics is hindered by branch divergence. Dedicated hardware designs like Speedata's Coarse-Grained Reconfigurable Architecture (CGRA) are necessary to fully optimize data analytics workloads, offering up to 15x performance improvements over GPUs.