Distributed Computing

Distributed Computing#

Content from

Geek Time Column: “Principles and Algorithm Analysis of Distributed Technology”

MapReduce Computing Model (Divide and Conquer)#

Divide and Conquer Concept#

Abbreviated as Divide and Conquer, it refers to breaking down a complex, difficult-to-solve large problem into smaller, simpler, or directly solvable subproblems. These subproblems are independent of each other and share the same form as the original problem. The subproblems are solved recursively, and then the solutions to the subproblems are combined to obtain the solution to the original problem.

Computing Process#

The entire workflow of MapReduce can be summarized in five stages: Input, Splitting, Mapping, Reducing, and Final Result.

Fork-Join Computing Model#

Fork-Join is a native multithreaded parallel processing framework provided by languages or libraries like Java, adopting a thread-level divide and conquer computing model. It fully utilizes the advantages of multi-core CPUs by recursively splitting a task into multiple "small tasks" and executing them in parallel on multiple processors, which is the Fork operation. Once the small tasks are completed, their results are merged to obtain the result of the original task, which is the Join operation.

Fork-Join does not scale well and is only suitable for running on a single Java Virtual Machine. Although multiple small tasks run on different processors, they can communicate with each other, and one thread can "steal" subtasks from other threads.

Stream Computing Model#

After the tasks in the MapReduce model are completed, the entire task process ends, which belongs to a short task model. However, starting and stopping task processes is time-consuming. Therefore, for stream data processing, especially for real-time tasks with high latency requirements, the Stream computing model is often utilized.

Stream Working Principle#

Stream computing emphasizes real-time processing. Once data is generated, it is immediately processed. After a piece of data is processed, it is serialized and stored in a cache, then immediately transmitted over the network to the next node for further processing, rather than waiting for the cache to fill up like in MapReduce. To ensure data real-time processing, no data is stored in stream computing, flowing forward like water.

Computing Steps#

Submit Stream Computing Job: For stream computing jobs, the computing logic must be predefined, and once submitted, the logic cannot be changed during execution; it can only be resubmitted.
Load Stream Data for Stream Computing: Once a stream computing job is started, it remains in a waiting state for event triggers. When a small batch of data enters, the system immediately executes the computing logic and quickly obtains results.
Continuously Output Computing Results: After obtaining the results of a small batch of data, the result data can be immediately written into online/batch systems without waiting for the overall data computation results, achieving real-time display of computation results.

Actor Computing Model#

The Actor model represents a distributed parallel computing model. This model has its own set of rules that define the internal computing logic of Actors and the communication rules between multiple Actors. In the Actor model, each Actor is equivalent to a component in the system and is a basic computing unit.

The three elements of the Actor model are state, behavior, and message, encapsulated in a popular equation: Actor Model = (State + Behavior) + Message.

Actor Workflow#

When Actor A and Actor B need to execute the Function logic in Actor C, they send messages to Actor C. Actor C's message queue stores the messages from Actor A and Actor B, and then executes the Function based on the order of the messages.

Actor Advantages and Disadvantages Analysis#

Advantages#

Achieves a Higher Level of Abstraction. Actors are similar to OOP objects, encapsulating state and behavior. However, Actors communicate asynchronously, allowing multiple Actors to run independently without interference, solving the competition problem present in OOP.
Non-blocking. In the Actor model, Actors communicate asynchronously, so when one Actor sends a message to another, it does not need to wait for a response and can continue running other tasks locally after sending the message. In other words, the Actor model avoids blocking by introducing a message-passing mechanism.
No Need for Locks. An Actor can only read one message from its Mailbox at a time, meaning it can only process one message at a time internally, serving as a natural mutex, thus eliminating the need for additional code locking.
High Concurrency. Each Actor only needs to handle messages from its local Mailbox, allowing multiple Actors to work in parallel, thereby enhancing the overall parallel processing capability of the distributed system.
Easily Scalable. Each Actor can create multiple Actors, reducing the workload of a single Actor. When a local Actor cannot handle the load, it can start an Actor on a remote node and forward messages there.

Disadvantages#

The Actor model provides modularity and encapsulation but lacks inheritance and layering, meaning that even if multiple Actors share common logic or code, this part must be rewritten in each Actor, resulting in low reusability and requiring a complete rewrite of code when business logic changes.
Actors can dynamically create multiple Actors, leading to constantly changing behavior in the Actor model, making it difficult to implement in engineering. Additionally, increasing the number of Actors also increases system overhead.
The Actor model is not suitable for systems with strict requirements on message processing order. Since messages in the Actor model are asynchronous, the execution order of each message cannot be determined. Although blocking Actors can be used to solve order issues, this significantly impacts the task processing efficiency of the Actor model.

Applications of the Actor Model#

Akka
Quasar (Java)
Erlang/OTP

Summary of the Actor Model:

Pipeline Computing Model#

The pipeline technology in computers is a technique that splits each instruction into multiple steps, allowing different steps of multiple instructions to overlap, thus achieving parallel processing of several instructions. Modern CPU instructions adopt a pipeline design, dividing a CPU instruction into five stages: Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), and Write Back (WB). In the distributed field, the pipeline computing model is similar; it splits a large task into multiple steps, with different steps potentially executed by different processes.

Computing Process#

Taking data preprocessing in machine learning as an example, suppose there are 5 sample data points, and the data preprocessing process for each sample includes three steps: data deduplication, handling missing values, and data normalization, which need to be executed in order. This means that the data preprocessing task can be split into three subtasks: data deduplication -> handling missing values -> data normalization. If there are now three nodes, node 1 executes data deduplication, node 2 handles missing values, and node 3 performs data normalization. Thus, when node 1 finishes processing the data for sample 1 and sends the processed data to node 2, node 1 can continue processing the data for sample 2 while node 2 processes the data for sample 1, and so on, achieving parallel execution of multiple tasks.

Applications of the Pipeline Computing Model#

Machine learning pipeline tasks, such as TensorFlow
Apache Beam (not specifically researched, but based on pipeline processing concepts)

Summary and Expansion#

Differences Between Stream Computing and Batch Computing#

What is the difference between the pipeline model and the MapReduce model, both of which involve splitting large tasks into multiple subtasks?#

MapReduce divides large tasks into smaller tasks at the task level, where each task must execute the same complete steps, and the same task can be executed in parallel, making it a task-parallel computing model.
In contrast, the pipeline computing model divides a task into multiple steps at the step level, where each step executes different logic, and multiple tasks of the same type achieve parallel computation through overlapping steps, making it a data-parallel model.

Additionally, the relationships between their subtasks (steps) differ:

In MapReduce, each subtask can execute independently without interference, and after multiple subtasks complete, their results are merged to obtain the overall task result, meaning there are no dependencies between subtasks.
In the pipeline model, multiple subtasks have dependencies, where the output of the previous subtask serves as the input for the next subtask.

In summary, the MapReduce computing model is suitable for task-parallel scenarios, while the pipeline computing model is suitable for data-parallel processing of tasks of the same type.

Yige