How to collect Apache Spark logs for Speedata Workload Analyzer?

Ofir Manor
Jan 15
4 min read

Updated: May 27

The Speedata Workload Analyzer is a tool that lets you analyze Apache Spark event log files and generate a report of the Speedata APU's projected benefit for your workloads. You can quickly learn which queries will benefit the most, will a faster network have a big impact on your APU environment or not, or see a detailed per-stage analysis of the benefits or limits. Below is an example of Speedata Workload Analyzer results of certain query execution.

Example of Speedata Workload Analyzer results

In this post, we will focus on the main pre-requisite to using Workload Analyzer - identifying good Spark job candidates for analysis and locating their event log files.

1. Find great Spark job candidates for Speedata acceleration

When evaluating Speedata APU, you will likely want to identify jobs with the potential for a significant acceleration. For that, we recommend looking for Spark jobs that are:

Long-running - Speedata is optimized for accelerating long-running, CPU-intensive queries. Try looking for jobs running minutes to hours, as those are likely to benefit the most.
Meaningful to the business - typically, there are a few hourly or daily jobs that consume a significant portion of your Spark cluster resources, or their long duration is problematic.
Using Spark SQL or Dataframes - Speedata integrates with the Spark optimizer (“Catalyst”) to transparently accelerate queries. Jobs that are written using Spark’s low-level API (RDD API) are not eligible for acceleration, as they bypass the Spark optimizer.
Based on Spark vanilla runtime – Spark logs using alternative Spark runtimes such as NVIDIA RAPIDS, Databricks Photon, or Apache Gluten+Velox are not yet supported. If you use those runtimes, you can re-run your job on vanilla Spark, use Workload Analyzer to get a prediction, and compare it to the performance of your existing runtime.

2. Make sure Spark Event Logging is enabled

Speedata Workload Analyzer processes Spark event log files. The Spark driver is writing those log files, so the logging is controlled by the Spark configuration of your job.

Here is how to make sure Spark emits these logs and control their location and format:

Understand where your Spark driver is running
- If your Spark job is running in cluster mode, the Spark configuration will be located in each of your Spark nodes, or in the container image of your Spark pods.
- If your Spark job is running in client mode, the Spark configuration will be located in the app server submitting the job (for example, Spark Connect or Apache Livy servers), outside the Spark cluster.
Find your Spark configuration file
From one of the nodes you have identified in the previous step, access $SPARK_HOME/conf directory, and look at spark-defaults.conf file.
Check/set the following parameters
By default, the Spark driver does NOT emit the Spark event log files.
To emit those files, set the following parameters in spark-defaults.conf file:
- spark.eventLog.enabled must be set to true for Spark to emit event log files.
- spark.eventLog.dir points to the directory where event logs will be stored.

If the Spark event log was disabled, you should run the jobs again after enabling it so they will emit this log.

Alternatively, you can control those parameters when submitting your job (overriding the defaults).

NOTE - Spark rolling event logs

Spark rolling event logs is an optional configuration. This option switches log files when they reach a certain size. This is useful when you have a long-running daemon that serves queries forever, such as Spark Connect or Apache Livy. This feature is controlled by setting spark.eventLog.rolling.enabled to true.

Note that with this enabled, each job will have a directory of event log files instead of a single log file, so you will need to pick the entire directory for analysis.

Here is an example of such a log directory with four event files:

~/spark_logs$ ls -l eventlog_v2_local-1725531639815
total 70868
-rw-r--r-- 1 ofir ofir        0 Sep  5 15:24 appstatus_local-1725531639815
-rw-r--r-- 1 ofir ofir 10488183 Sep  5 15:24 events_1_local-1725531639815
-rw-r--r-- 1 ofir ofir 10488245 Sep  5 15:24 events_2_local-1725531639815
-rw-r--r-- 1 ofir ofir 10485118 Sep  5 15:24 events_3_local-1725531639815
-rw-r--r-- 1 ofir ofir 10488020 Sep  5 15:24 events_4_local-1725531639815

3. Collecting the Spark event logs

Now that you have picked some candidate jobs for analysis and enabled Spark event logging for them, all you need is to find those log files and copy them to your computer or to where you will run the Speedata Workload Analyzer CLI. You can do this from Spark UI or CLI.

Collecting from the Spark History Server

Assuming you have Spark History Server up and running, if you go to its main page you will see a list of all the past jobs that were kept around. You can sort or filter by app name, start time, duration, Spark user, etc, to find the specific job you want, and then directly download the relevant event log file from its UI:

Find your logs using CLI

You can navigate to the directory where you store your Spark event logs (controlled by the spark.eventLog.dir parameter that was discussed above). From there you can copy all files from a specific date or hours.

Alternatively, you can use grep or similar commands to find log files containing your app name. For example, looking for the string tpcds_39 in all log files (starting with app*), and then using ls to see their last modified timestamp:

~/spark_logs$ grep -li tpcds_39 app* | xargs ls -l
-rw-r--r-- 1 ofir ofir 4720484 Nov 17 16:18 application_1719468522946_0008_1
-rw-r--r-- 1 ofir ofir 4659916 Nov 17 16:18 application_1719468522946_0009_1

That's it!

Now that you have collected the Spark event log files for your job, you can use Workload Analyzer to get a prediction of Speedata’s APU likely acceleration for your queries.

See How Speedata Performs on Your Workloads

1. Find great Spark job candidates for Speedata acceleration

2. Make sure Spark Event Logging is enabled

3. Collecting the Spark event logs

That's it!