Spark Basics - Submitting Spark Jobs#
Content organized from:
Spark Job Submission Process#
Terminology#
-
Application
: A Spark application written by the user, which includes code for a Driver function and Executor code running on multiple nodes distributed across the cluster. -
Driver
: The Driver runs the main() function of the above Application and creates the SparkContext to prepare the runtime environment for the Spark application. In Spark, the SparkContext is responsible for communicating with the ClusterManager, applying for resources, task allocation, and monitoring; after the Executor part finishes running, the Driver is responsible for shutting down the SparkContext. Typically, SparkContext represents the Driver. -
Worker
: Any node in the cluster that can run Application code, similar to the NodeManager node in YARN. In Standalone mode, it refers to Worker nodes configured through the Slave file, while in Spark on Yarn mode, it refers to NodeManager nodes. -
Executor
: A process running on a Worker node for the Application, responsible for executing Tasks and storing data in memory or on disk. Each Application has its own independent set of Executors. -
Cluster Manager
: Refers to the external service that acquires resources in the cluster. Currently, there are:- Standalone: Native resource management by Spark, with resource allocation managed by the Master.
- Hadoop Yarn: Resource allocation managed by the ResourceManager in YARN.
Execution Process#
Two core components: DAGScheduler
and TaskScheduler
For details, see: Spark Advanced - Spark Scheduling System
Three Job Submission Modes in Spark#
Local Mode#
- In local mode, there is no concept of
master+worker
. - Local mode essentially starts a local process, simulating the execution of a Spark job within a single process, corresponding to one or more executor threads within that process, which begins execution, including job scheduling and task allocation.
Standalone Submission Mode#
In standalone mode submission, the master needs to be set to spark://master_ip, for example, spark://192.168.75.101:7077.
Standalone Client Mode#
Process Description
-
After the client starts, it directly runs the user program, initiating Driver-related tasks: DAGScheduler and BlockManagerMaster, etc.
-
The Driver of the client registers with the Master.
-
The Master also instructs the Worker to start the Executor. The Worker creates an ExecutorRunner thread, which starts the ExecutorBackend process.
-
After the ExecutorBackend starts, it registers with the Driver's SchedulerBackend. The Driver's DAGScheduler parses the job and generates the corresponding stages, with each stage's tasks allocated to Executors for execution by the TaskScheduler.
-
The job ends once all stages are completed.
Standalone Cluster Mode#
-
The client submits the job to the master.
-
The Master selects a Worker node to start the Driver, i.e.,
SchedulerBackend
. The Worker creates a DriverRunner thread, which starts the SchedulerBackend process. -
The Master instructs the remaining Workers to start Executors, i.e.,
ExecutorBackend
. The Worker creates an ExecutorRunner thread, which starts the ExecutorBackend process. -
After the ExecutorBackend starts, it registers with the Driver's SchedulerBackend. The SchedulerBackend process includes the DAGScheduler, which generates an execution plan based on the user program and schedules execution. For each stage's tasks, they are stored in the TaskScheduler, and when the ExecutorBackend reports to the SchedulerBackend, it schedules the tasks from the TaskScheduler to be executed by the ExecutorBackend.
Differences Between Standalone Client and Standalone Cluster#
- In client mode, the Driver starts on the machine where the spark-submit script is run.
- In cluster mode, the Driver is randomly assigned to start on a Worker process via the master process.
Yarn Submission Mode#
Yarn-Client Mode#
-
The Spark Yarn Client requests to start the
Application Master
from YARN'sResourceManager
. -
After receiving the request, the ResourceManager selects a NodeManager in the cluster to allocate the first Container for the application, requiring it to start the ApplicationMaster in this Container. Unlike YARN-Cluster, the ApplicationMaster does not run SparkContext but only communicates with SparkContext for resource allocation.
-
Once the SparkContext in the Client is initialized, it establishes communication with the ApplicationMaster, registers with the ResourceManager, and requests resources (Containers) based on task information.
-
After the ApplicationMaster obtains resources (i.e., Containers), it communicates with the corresponding NodeManager to request it to start the CoarseGrainedExecutorBackend in the obtained Container. After starting, the CoarseGrainedExecutorBackend registers with the SparkContext in the Client and requests Tasks.
-
The SparkContext in the Client allocates Tasks to the CoarseGrainedExecutorBackend for execution, which runs the Tasks and reports the execution status and progress back to the Driver.
-
After the application completes, the SparkContext in the Client requests deregistration from the ResourceManager and shuts itself down.
YARN-Cluster Mode#
In YARN-Cluster mode, when a user submits an application to YARN, the application runs in two stages:
- The first stage is to start the Spark Driver as an ApplicationMaster in the YARN cluster.
- The second stage is for the ApplicationMaster to create the application, request resources from the ResourceManager, and start Executors to run Tasks while monitoring the entire execution process until completion.
Differences Between YARN-Client and YARN-Cluster#
-
In
YARN-Cluster
mode, the Driver runs in the AM (Application Master), which is responsible for requesting resources from YARN and supervising the job's execution status. Once a job is submitted, the Client can be closed, and the job will continue running on YARN, making YARN-Cluster mode unsuitable for interactive jobs. -
In
YARN-Client
mode, the Application Master only requests Executors from YARN, and the Client communicates with the requested Containers to schedule their work, meaning the Client cannot leave. The Driver runs on the client machine that submitted the Spark job, allowing real-time access to detailed log information, which is helpful for tracking and troubleshooting errors, and is used for testing.