Efficiently Processing Apache Parquet File Format
Most big data environments store datasets in Parquet format, which organizes data by columns for efficient querying and compression. By organizing data in columns, Parquet compresses data efficiently with a good compression ratio, as table values tend to have more similarity within a single column than across all columns in a single row. Furthermore, the Parquet file format consists of row groups and column chunks as a key organizational structure to enable efficient data processing. Row groups are subsets of rows from the entire dataset, allowing parallel processing and efficient row data filtering. Column chunks contain the actual data values for a specific column allowing for efficient partial reads of column subsets, improving query performance.
This approach minimizes I/O bottlenecks and speeds up data reading from memory by leveraging similarities within columns.
Parquet file format (Link)
Parquet breaks data into row groups and column chunks, and for each column, it applies encoding (like RLE/dictionary) and then optionally compression, for example, gzip or Snappy. To illustrate this, we’ll use a simple query example for the rest of the post and examine a portion of TPC-H Query 16 (database benchmark link). The query reads the “part” table and selects 3 columns: p_brand, p_type, p_size. It filters out p_brand='Brand#34', and p_type='MEDIUM POLISHED%' and chooses different sizes.
SELECT p_brand, p_type, p_size
FROM part
WHERE
p_brand <> 'Brand#45'
AND p_type NOT LIKE 'MEDIUM POLISHED%'
AND p_size IN (49, 14, 23, 45, 19, 3, 36, 9)
To understand the different phases of reading a parquet file from memory, we can demonstrate plain data of the “p_brand” column in the “part” table. In the below example we will use the following values:
[‘Brand#13’,‘Brand#32’,‘Brand#32’,‘Brand#32’,‘Brand#13’,‘Brand#13’,‘Brand#11’,‘Brand#11’,‘Brand#11’,‘Brand#11’,‘Brand#11’]
Before running a standard compression algorithm, Parquet would create a dictionary to represent this data more efficiently:
Dictionary: [‘Brand#13’, ‘Brand#32’, ‘Brand#11’]
indices: [0,1,1,1,0,0,2,2,2,2,2]
Parquet further compresses the data using run-length encoding, represented by rle-run which is a number of consecutive occurrences with a dictionary index. (rle-run=(occurrence, dictionary index))
[rle-run=(1,0), rle-run=(3,1), rle-run=(2,0), rle-run=(5,2)]
Moreover, Parquet applies compression algorithms like gzip or Snappy after encoding to reduce file size.
Parsing parquet file with CPU
Below is a scheme describing the flow of a column parsed in a CPU flow, one column at a time.

The major steps required for reading a parquet file from memory to analyze its data:
Read data from memory.
Decompress file format (gzip, Snappy, etc.), Write intermediate results into memory.
Read from memory, then Decode (RLE), Write to memory.
Read from memory, then Perform dictionary lookup, and Write to memory.
Read from memory, then Row reconstruction (collection) to a custom format.
The steps described above write the intermediate results to memory. Despite modern CPUs leveraging parallelism through multiple cores, SIMD instructions, and cache memory for efficiency, they still write intermediate data back to memory after each step of the flow. Some constraints apply if data is not aligned and uniform we cannot use SIMD instructions and big data cannot fit into cache memory. It is worth noting that the filtering part (“WHERE”) is done only after the parquet file is written to memory as row data.
Writing and reading to/from memory adds latency, causes congestion in the memory bus, disrupts parallelism, reduces the effectiveness of SIMD instructions, and consumes more system power.
How can these memory accesses be avoided for more efficient handling of parquet files?
Just like video encoders and packet decoders (NIC) have custom-built hardware accelerators, here we have the same structured instructions so we can use dedicated hardware. It can decompress and decode multiple columns in parallel, keeping intermediate data on-chip to significantly speed up file reading from memory. Dedicated HW should utilize the parquet file features:
Using HW-efficient compression engines for column-oriented storage.
Filtering unnecessary data using metadata utilization of row groups and column chunks.
Parallel processing on different row groups simultaneously.
Handle strings with highly variable lengths efficiently.
Reading parquet from Speedata APU (analytics processing unit)
Streaming pipeline
At Speedata, we have built a hardware streaming pipeline that can decompress, decode, process the dictionary, and begin processing the data, at the speed at which we can retrieve the data from memory, without leaving the Speedata APU to store intermediate results. Unlike a CPU which writes the intermediate data back to memory, the APU streams the outputs of each step directly to the inputs of the following step, using a dedicated hardware optimized for the type of computation happening in each step.
Moreover, having multiple HW units, it can process many columns in parallel. The APU contains around 200 parallel dedicated decompression HW engines. One APU decompression throughput of Parquet files is equivalent to around 150 fully occupied CPU cores dedicated just for the decompression functionality.
The following charts show these differences graphically:

Pre-processing phase
In addition to streaming the data directly from one step to the next, the APU performs a “Pre-Processing” step before transposing columns into rows. In this pre-processing step, the APU performs additional operations, including value comparisons, that will be used to filter records out in a subsequent step. These comparisons include regular expression matching using hardware accelerators optimized for such operations. In this pre-processing step, the APU can also perform data format conversions (such as timestamp/date conversions), extract and process nested lists in a column field, and perform other column-level operations all at line rate.
Efficient filtering
Once the APU has exhausted all column-level computations, it combines information from multiple columns for initial filtering. For example, if we care about records where “p_brand=’Brand#13’ OR (p_brand = ‘Brand#11’ AND size < 20)” this AND/OR operation happens in the Filter step. Since the column-level pre-processors have already done the value comparisons, this step simply evaluates the precomputed metadata from multiple columns.
Explode operation during pipeline without writing to memory
In some cases, the column may contain nested data in the form of structs or lists. The APU accelerates the EXPLODE (link) operator in this step, thereby flattening the data into duplicate rows, each with a different value of the list or struct. This operation works even for multiple levels of nesting, such as a list of structs all without memory accesses in the stream.
By taking care of these steps during the initial reading of column data, the APU frees up compute capacity for subsequent operations that can only happen at a row level, thereby speeding up the end-to-end query execution.
Summarizing the main advantages of APU:
Streaming pipeline avoids memory accesses.
Parallel processing of many columns speeds up execution.
The pre-processing phase, string processing, and filtering are used to filter out records in the early stages.
Accelerated Explode operation during pipeline without writing to memory.
Conclusion
Parquet is a widely used columnar file format in big data environments, offering efficient querying and compression through organized columns and row groups. While modern CPUs leverage parallelism and caching, intermediate writes to memory during processing still hinder performance. Speedata APU addresses this challenge with a custom hardware pipeline flow, which processes Parquet data directly on-chip without intermediate writes. The APU performs decompression, decoding, filtering, and pre-processing at line rate, using metadata to optimize row filtering and accelerate nested data handling. Streaming data through specialized hardware like Speedata’s APU, significantly speeds up query execution and enhances efficiency compared to traditional CPU-based approaches.