Maximizing Hardware Efficiency: How Speedata Solves the Firmware Challenge with a Promise-based Async Framework

Omri Mezrich
4 days ago
6 min read

The golden rule of high-performance hardware is that the firmware’s #1 job is to keep the hardware accelerators busy. In the world of big data analytics, every single cycle that an engine sits idle is wasted throughput.

At Speedata, we developed the C200 Analytics Processing Unit (APU), a custom accelerator card designed from the ground up to accelerate big data analytics workloads by up to 100x. But building powerful silicon is only half the battle.

To actually achieve these massive speedups, the software needs to stay out of the way and seamlessly feed data to the hardware. Here’s how we achieve this at Speedata.

The Challenge of Keeping Silicon Busy

In modern accelerators and ASICs, keeping silicon busy is often the main firmware challenge. The hardware might theoretically deliver 10× more throughput than CPUs, but if scheduling and concurrency aren’t handled correctly, you might only see a fraction of that performance.

Modern silicon is built for parallel execution. Therefore, to maximize throughput, firmware must drive multiple hardware engines concurrently. The core responsibilities include:

Orchestrating hardware engines: Managing tasks like DMA (Direct Memory Access), decoding, filtering, and hashing.
Overlapping operations: Executing tasks in a continuous pipeline, such as firing the DMA to fetch the next unit of work from memory while the hardware engine is still busy processing the current unit.
Handling multiple execution contexts: Serving independent command streams simultaneously on a single thread.
Managing asynchronous events: Responding to host commands that arrive unpredictably via interrupts.

To keep are and power overheads to a minimum, accelerator control planes are typically built under tight resource constraints. At Speedata, our APU uses over 70 control-plane cores to orchestrate these hardware datapath engines. For these control planes, we rely on the open-source RISC-V VeeR EH1 core (rv32imc). While this RISC-V architecture is incredibly efficient and allows for fast vectored interrupts and smaller code sizes, it leaves us with a strict memory limit: the firmware must fit inside just 64KB of instruction memory (ICCM) and 64KB of data memory (DCCM).

Why Traditional Approaches Fail

When tasked with managing concurrent operations on bare-metal hardware, engineers typically turn to two standard solutions: RTOS-based threading or manual state machines. In theory, both can coordinate multiple tasks. In practice, each introduces its own set of problems when the system must drive hardware engines at very high throughput.

RTOS-based designs rely on threads, scheduling, and context switching to interleave work. This abstraction is convenient, but it comes with overhead. Every context switch requires saving and restoring registers, stack state, and scheduler metadata. On systems where operations are extremely short (such as hardware queues or DMA transactions) this overhead can become significant relative to the work being performed. RTOS environments can also introduce complex synchronization issues: mutex contention, priority inversion, and race conditions between threads interacting with the same hardware resources. These bugs are difficult to reproduce and debug in real-time firmware.
State machines avoid threading overhead but often become unmanageable as complexity grows. Each operation has its own states: waiting for data, processing, writing results, and handling errors. When multiple operations run concurrently, engineers must track every possible combination of these states. This creates a state explosion, where the number of possible transitions grows exponentially. The result is sprawling switch statements, tangled logic, and code that becomes extremely difficult to reason about, extend, or debug. What starts as a clean control flow often turns into fragile logic that few engineers feel comfortable modifying.

The Speedata Solution: A Promise-Based Async Framework

Speedata built a custom promise-based asynchronous framework. The goal was simple: To allow firmware to orchestrate many hardware operations concurrently. The result is a lightweight runtime written in modern C++20, implemented in roughly 1,000 lines of core code, running on bare metal with no operating system.

Instead of blocking while waiting for hardware to finish an operation, the firmware models hardware interactions as asynchronous tasks. When firmware launches an operation (such as triggering a decompression engine or starting a memory transfer) it immediately receives a Promise object that represents the future completion of that task.

The firmware then continues executing other work, often launching additional hardware operations. Once the hardware completes its task, it triggers a fast interrupt, and the framework resolves the promise and schedules the next step in the computation.

Speedata async framework flow - APU — Speedata's Async Framework Flow

This approach allows the firmware to keep hardware engines busy at all times. Rather than waiting for one stage of processing to finish before starting the next, multiple hardware engines can run in parallel while the software coordinates their results. The code still reads like sequential logic: a developer describes the intent of the operation chain, while the framework manages the asynchronous execution behind the scenes.

A Small Set of Primitives

The framework intentionally relies on a small set of building blocks to keep the system predictable and maintainable. At the lowest level are objects responsible for event handling: waiting for hardware signals and tracking which operations are waiting for them. Above that are abstractions representing asynchronous state and continuations, which hold the results of operations and determine what code should execute next. A lightweight scheduler then runs these continuations cooperatively.

Primitive	Role
Awaitable / Awaiter	Registers callbacks for hardware events and resumes tasks when events occur
Promise / Task	Represents the result of an asynchronous operation and enables chaining of continuations
WorkQueue / WorkItem	Cooperative run-to-completion scheduler
PoolAllocator	Fixed-size block memory pools (no heap allocation)

Because the scheduler is cooperative rather than preemptive, tasks run until they voluntarily yield when awaiting a hardware event. This eliminates context switch overhead while keeping the runtime deterministic, which is an important property for firmware controlling hardware pipelines.

Why the Async Model Works Well for Hardware

The async model provides several key benefits for firmware that orchestrates hardware accelerators:

Expressive code: the logic describes intent and reads like sequential code, even though execution is asynchronous.

Better hardware utilization: firmware can keep multiple engines active simultaneously instead of waiting for each operation to finish.

No context switching overhead: cooperative scheduling avoids the cost and complexity of RTOS thread switching.

This makes the control software easier to reason about while still achieving the high concurrency required to drive the datapath efficiently.

Composability in Action

To illustrate how this solves our hardware challenge in practice, consider one of our core tasks: the Table Reader.

The Table Reader core is responsible for scanning compressed table data from the host memory by traversing a list structure of buffer descriptors. Specifically, it must fetch a buffer descriptor using a DMA transfer and then push those descriptors to the hardware engine to be decompressed and decoded.

Using our promise-based framework, we can easily build a "double-buffered pipeline" that maximizes throughput. In a single, highly readable C++ function, we use two key primitives to orchestrate this:

Join() to express parallelism. We instruct the firmware to simultaneously push the current data node to the hardware engine (PushToEngine) while concurrently firing off a DMA transfer to fetch the next node from memory (DmaFromHost).
.Then() to express dependency. We chain the next iteration of the loop so that the core only moves on after both the engine push and the memory fetch have successfully completed.

This demonstrates the power of composability, the ability to build complex, high-level asynchronous workflows entirely out of simple, lower-level async operations.

Instead of getting lost in a maze of state machine logic, the code reads sequentially and is easy to maintain. Most importantly, each iteration perfectly overlaps the DMA fetch with the hardware engine push. This guarantees that both the memory bus and the hardware engine are kept constantly busy, ensuring we never waste a single cycle of throughput.

Lessons Learned

Building a modern async framework on bare metal produced a few useful lessons.

What worked well

C++20 on bare-metal RISC-V. Modern language features and toolchain support made it practical to use contemporary programming patterns even in firmware.
Clang/LLVM 18. Link-time optimization (LTO) and identical code folding (ICF) helped reclaim template bloat and produced smaller binaries than GCC.
VeeR PIC + fast ISRs. Low interrupt overhead allowed very fine-grained asynchronous operations.
Promise chains. Developers found them intuitive, and the resulting code reads almost like synchronous logic.

Where we had to be careful

Template proliferation. Static type safety is valuable, but it can increase code size if left unchecked.
Debugging. Promise chains can be harder to follow in traditional debuggers compared to linear code paths.
Profiling. Firmware environments still need better performance event support for analyzing async workloads.

The power of Speedata's async framework reduces context switch overhead, fits within a tiny memory footprint, and most importantly, guarantees that our hardware accelerators never waste a single cycle.

Try the APU and test what we've built.

Test the Workload Analyzer.

See How Speedata Performs on Your Workloads