As demands on Apache Spark clusters in the data center have grown year over year, the total cost of analytics has accelerated even faster, while capacity and performance per dollar have remained relatively stagnant.
Both in terms of the capital expenditure and operating cost, database analytics has remained the most expensive workload in the data center for most organizations. Meanwhile, the data center footprint, energy consumption, environmental impact, maintenance requirements, and governance and compliance of systems supporting analytics have exploded the cost of hosting a Spark environment.
For data engineers, this means overcoming the challenge of clusters that are already too slow for existing Spark workloads and certainly incapable of supporting the next wave of data-hungry Spark projects. Meanwhile, the frequent need to scale Spark CPU cluster capacity to address constantly growing compute and storage requirements is very challenging, in terms of both hardware costs and data center footprint.
KEY OBJECTIVES FOR SPARK ACCELERATION
The primary benefits of Spark acceleration include performance improvement, time savings, increased processing capacity, scalability, and task offloading:
Spark acceleration can significantly accelerate the processing speed of long-running analytics workloads.
Faster processing times mean quicker completion of jobs and reduced resource consumption, leading to overall better resource utilization.
Faster processing times result in quicker insights, which can be crucial for businesses.
This time savings directly contributes to business goal-setting, prioritization, and decision-making, which increases top-line revenue potential.
It also indirectly contributes to cost savings and improved efficiency by reducing time wasted on superfluous research or additional people addressing the same task.
Increased Processing Power
Spark acceleration enables specific compute tasks to be performed with much higher parallel processing power compared to general-purpose CPUs.
This increased processing power allows you to achieve more computational throughput per unit of hardware, reducing the number of servers needed for a given workload and increasing the size and volume of workloads the same amount of infrastructure can handle without investing in additional resources.
This is known as price-performance improvement.
Spark acceleration can improve the scalability of your cluster without the need for a proportional increase in physical infrastructure.
This scalability allows you to handle growing workloads without a linear increase in infrastructure costs or continuously expanding the physical footprint of your data center.
Offloading certain tasks to specialized processors reduces the burden on general-purpose CPUs. This can lead to better overall resource utilization, enable more compact server configurations, and an acceleration of all workloads, not only analytics on Spark.
As data center cost, space, and power consumption become factors, the price-performance limitations for analytics of traditional processors like CPUs typically lead to missed SLAs and spiraling cloud costs to offset availability constraints, as well as obstructing data refreshes and preventing transformational analytics projects from getting kicked off at all.
Although some specialized hardware components like GPUs have demonstrated faster performance than CPUs for Spark workloads, the acceleration required for long-running analytics jobs like ETL and SQL is often dependent on software add-ons, which present new challenges for Spark acceleration.
THE SHORTCOMINGS OF SPARK ACCELERATION WITH SOFTWARE ALONE
Faced with the price-performance challenges of accelerating Spark by simply increasing compute alone, software acceleration has become a common, albeit inefficient, solution. Software accelerators are designed to improve the speed and scale of existing general-purpose processors for specific tasks.
The optimization does not happen at the hardware level, instead targeting speeds and feeds of a chosen pipeline at the algorithmic level or, more commonly, at the coding level.
While software accelerators can be versatile, their use cases are fundamentally constrained by the underlying hardware and those processors’ inability to handle certain elements that are essential to analytics on large, diverse data sets, including branch divergence and parallelism.
So, while software accelerators can increase CPU performance for Spark analytics workloads, they also often require code changes and complex configuration settings.
Similarly, some engines have capitalized on the market rush for GPUs, which have become the most prominent commercial domain-specific processors, albeit designed for graphics and ML, not database analytics.
An approach that uses GPUs to accelerate Spark analytics jobs requires users to migrate their workloads into new frameworks, adapt the code accordingly, and then begin testing, debugging, and optimizing both functionality and performance. At best, this large investment results in a modest acceleration of 3x over running analytics workloads on GPUs alone, which were not designed for analytics workloads.
Examples of these new engines that take advantage of GPUs to accelerate Spark include Sqream, Kinetica, and HEAVY.AI.
FOR SPARK ACCELERATION
Ultimately, some of these Spark acceleration efforts are, at best, a stopgap solution targeting a niche set of users that are unusually constrained and for whom a performance improvement in the short term is worth any price or opportunity cost.
Alternatively, a long-term approach to Spark acceleration is architected starting with a processor designed specifically for database analytics. Speedata’s Analytics Processing Unit (APU) is custom-designed for Spark acceleration and the acceleration of other analytics engines.
The APU executes a broad range of tasks in parallel and can handle any data type and field length. Importantly, the APU automatically intercepts the work that was previously going to the CPU and reroutes it with zero framework and code changes to accelerate Spark performance with minimal overhead.
Using a plugin that’s easily installed in an existing cluster, the Speedata APU intercepts the logical plan of a workload many layers down from user code, so the existence of the APU remains transparent to Spark users. Once intercepted, the Speedata compilers generate APU-specific code for the query, which is then sent to all the worker nodes running in the cluster.
This approach allows users to run the same ETL jobs and SQL queries they were running before, unmodified, but at maximum Spark acceleration. This also unburdens data engineers from having to migrate their workloads and manage testing and debugging just to achieve modest speedups from processors not designed for analytics.
Speedata’s APU uniquely decompresses, decodes, and processes millions (or even billions) of records from Parquet or ORC files per second, eliminating the I/O, compute, and capacity bottlenecks created by other chips that have to write and store intermediate data back to memory. In head-to-head comparisons on TPC-DS and on production workloads, customer use cases from financial services, pharmaceuticals, adtech, and hyperscale cloud accelerated their Spark performance an average 50x with APUs compared to CPUs.
At scale, APUs are anticipated to deliver up to 100x price-performance improvement for critical long-running analytics workloads. And, compared to GPUs, APUs model a 91% capital reduction, 94% space savings, and 86% energy savings.