Distributed Resource Management and Scheduling

Distributed Resource Management and Scheduling#

Content from

Geek Time Column: “Principles and Algorithm Analysis of Distributed Technology”

Analysis of Google Cluster Management System Omega

Distributed Architecture#

Centralized Structure#

Consists of one or more servers forming a central server, where all data in the system is stored in the central server, and all business in the system is processed by the central server first, with resource and task scheduling unified by the central server.

Google Borg
Kubernetes
Mesos

Decentralized Structure#

The execution of services and the storage of data are distributed across different server clusters, with communication and coordination between server clusters done through message passing. Compared to centralized structures, decentralized structures reduce the pressure on a single or a cluster of computer clusters, solving single point bottlenecks and single point failure issues while also improving system concurrency, making it more suitable for managing large-scale clusters.

Akka Cluster
Redis Cluster
Cassandra Cluster

Scheduling Architecture of Monolithic Scheduler#

The characteristic of a monolithic scheduler is that the resource scheduling and job management functions are all completed in a single process. A typical representative in the open-source community is the implementation of Hadoop JobTracker.

Disadvantages
Poor scalability: First, the scale of the cluster is limited; second, new scheduling strategies are difficult to integrate into existing code. For example, if it previously only supported MapReduce jobs, now it needs to support streaming jobs, and embedding the scheduling strategy for streaming jobs into the monolithic scheduler is a challenging task.

Optimization Plan
Place each scheduling strategy in a separate path (module), with different jobs scheduled by different scheduling strategies. This plan can significantly shorten job response time when the job volume and cluster size are relatively small, but since all scheduling strategies are still in a centralized component, the overall system scalability does not improve.

Scheduling Architecture of Two-Tier Scheduler#

The two-tier scheduler retains a simplified centralized scheduler, but the scheduling strategies are delegated to various application schedulers. Typical representatives of this type of scheduler are Mesos and YARN.

The responsibilities of the two-tier scheduler are: the first-tier scheduler is responsible for managing resources and allocating resources to frameworks, while the second-tier scheduler receives resources allocated by the first-tier scheduler in the distributed cluster management system and matches them according to tasks and received resources.

Disadvantages

Each framework cannot know the real-time resource usage of the entire cluster.
Uses pessimistic locking with small concurrency granularity.
Taking Mesos as an example, the resource scheduler of Mesos will only push all resources to any framework, and only after that framework returns the resource usage information can it push resources to other frameworks. Therefore, there is actually a global lock in the Mesos resource scheduler, which greatly limits the system's concurrency.

Scheduling Architecture of Shared State Scheduler#

This scheduler simplifies the centralized resource scheduling module in the two-tier scheduler into some persistent shared data (state) and validation code for this data. The "shared data" here is essentially the real-time resource usage information of the entire cluster, with typical representatives being Google's Omega, Microsoft's Apollo, and Hashicorp's Nomad container scheduler.

After introducing shared data, the concurrent access method for shared data becomes the core of the system design. For example, Omega uses multi-version concurrency control (MVCC) based on traditional databases, which greatly enhances Omega's concurrency.

Resource Allocation Methods#

Two resource allocation methods:

incremental placement
all-or-nothing

For example: A task requires 2GB of memory, and a node has 1GB remaining. If this 1GB of memory is allocated to the task, it must wait for the node to release another 1GB of memory before the task can run. This method is called “incremental placement”, and Hadoop YARN adopts this incremental resource allocation method.
If only nodes with more than 2GB of remaining memory are chosen for the task, ignoring others, it is called “all-or-nothing”, which is used by both Mesos and Omega.

Both methods have their pros and cons. “All-or-nothing” may cause job starvation (tasks with large resource demands never receive the needed resources), while “incremental placement” can lead to resources being idle for long periods and may also cause job starvation. For instance, if a service requires 10GB of memory and a node currently has 8GB remaining, the scheduler allocates these resources to it and waits for other tasks to release 2GB. However, if other tasks run for a very long time and may not release resources in the short term, the service will not be able to run for a long time.

Summary#

Monolithic Scheduling
Managed by a central scheduler that oversees the resource information and task scheduling of the entire cluster, meaning all tasks can only be scheduled through the central scheduler.
The advantage of this scheduling architecture is that the central scheduler has information about the node resources of the entire cluster, allowing for globally optimal scheduling. However, its disadvantages include lack of scheduling concurrency and the central server having a single point bottleneck problem, limiting the supported scheduling scale and service types, while also restricting the scheduling efficiency of the cluster, making it suitable for small-scale clusters.

Two-Tier Scheduling
Divides resource management and task scheduling into two layers. The first-tier scheduler is responsible for managing cluster resources and sending available resources to the second-tier scheduler; the second-tier scheduler receives resources sent by the first-tier scheduler and performs task scheduling.
The advantage of this scheduling architecture is that it avoids the single point bottleneck problem of monolithic scheduling, supporting larger service scales and more service types. However, its disadvantage is that the second-tier scheduler often has only partial observability of global resource information, making it impossible to achieve globally optimal task matching, suitable for medium-scale clusters.

Shared State Scheduling
Multiple schedulers, each of which can see the global resource information of the cluster and perform task scheduling based on this information. Compared to the other two scheduling architectures, the shared state scheduling architecture is suitable for the largest cluster scale.

The advantage of this scheduling architecture is that each scheduler can access global resource information in the cluster, allowing the task matching algorithm to achieve global optimality. However, because each scheduler can perform task matching globally, multiple schedulers scheduling simultaneously may match the same node, leading to resource competition and conflicts.

Although Omega's paper claims to avoid conflicts through optimistic locking mechanisms, in engineering practice, if resource competition issues are not properly handled, resource conflicts may occur, leading to task scheduling failures. In this case, users need to handle failed tasks, such as rescheduling, maintaining task scheduling status, etc., further increasing the complexity of task scheduling operations.

Summary Analysis in the Table Below