Flink Basics

Basics of Flink#

Content reference from:

Apache Flink Zero-Basis Introduction (1): Basic Concept Analysis

Stateful Stream Processing#

Concept#

Traditional batch processing: Continuously collecting data, using time as the basis for batch division, and periodically executing batch calculations. Possible scenario issues:

If one hour is used as the basis for batch division, and we need to count the number of specific event conversions, if the conversion starts at the 58th minute and ends at the 7th minute of the next hour, how can we complete cross-batch data statistics?
The time taken for data to be generated and received is not necessarily the same; it is possible that an event occurs earlier than A but arrives later than A. How to handle this issue of reversed time order in received data?

The ideal method is:

Introduce a State mechanism that can accumulate and maintain state. Accumulated state represents all historical events received in the past, which affects the output results.
Time mechanism: A mechanism that can ensure the integrity of data operations, such as setting that only after all data in a certain time period is received can the output results be calculated.

This is what is known as stateful stream processing.

Challenges of Stateful Stream Processing#

State Fault Tolerance:
State Management:
Event-time Processing:
Savepoints and Job Migration:

State Fault Tolerance#

Exact-once fault tolerance guarantee for simple scenarios:
Data enters as an infinite stream, but the subsequent computation is a single Process. In this case, to ensure that the Process produces an exact-once state fault tolerance, after processing each piece of data and changing the state, a snapshot is taken. The snapshot is included in the queue and compared with the corresponding state. Completing a consistent snapshot ensures exact-once.

Distributed State Fault Tolerance#

In a distributed scenario, multiple nodes modify local states but produce only one Global consistent snapshot.
Fault tolerance recovery is based on the checkpoint mechanism.
An extension of the simple lamport algorithm mechanism implements distributed snapshots. Flink can continuously produce Global consistent snapshots without interrupting computation. The general method is that Flink inserts checkpoint barrier flags into the data stream. Subsequent Operators will save their states after receiving checkpoint barrier N in the data stream. This way, a checkpoint save state is established from the initial data source to the completion of the computation. Later, checkpoint barriers N+1, N+2 are also synchronized in the data stream, allowing checkpoints to be continuously produced without blocking operations.

State Management#

The methods currently supported by Flink are:

JVM Heap Storage State: Suitable for cases with a small amount of state, as it is directly stored in the JVM heap. When reading the state, it is done directly using Java objects without serialization. However, when Checkpoints need to place each operation's local state into Distributed Snapshots, serialization is required.
RocksDB Storage State: An out-of-core state backend. When users read the state from the local state backend at Runtime, it goes through the disk, effectively maintaining the state on the disk. The corresponding cost may be that each time the state is read, it requires serialization and deserialization, which may result in relatively lower performance.

Event-time Processing#

Processing Time: Process Time is the time when an event arrives and processing begins, which can vary due to network reasons.
Event Time: Event Time is the actual time an event occurs, determined by the timestamp carried by each processing record, which is unique and unchanging.
Ingestion Time: Refers to the time when an event enters the Flink data stream in the source operator.

Comparison

Processing Time is simpler to handle, while Event Time is more complicated.
When using Processing Time, the processing results (or the internal state of the stream processing application) are uncertain. However, because Flink has various guarantees for Event Time, using Event Time allows for relatively consistent and reproducible results, regardless of how many times the data is replayed.
When deciding whether to use Processing Time or Event Time, a principle can be followed: If your application encounters issues and needs to replay from the last checkpoint or savepoint, do you want the results to be exactly the same? If you want the results to be exactly the same, you must use Event Time; if you can accept different results, you can use Processing Time. A common use of Processing Time is when we want to count the throughput of the entire system based on real time, such as calculating how many records were processed in one hour of real time, which can only use Processing Time.

Reference Link

Flink Advanced - In-depth Analysis of Time

State Saving and Migration#

Implemented based on the Savepoint mechanism, Savepoints are similar to checkpoints, but the difference is that Savepoints are manually triggered for global state saving. For specific reference, see:
Flink Real-time Computing - In-depth Understanding of Checkpoints and Savepoints

Basic Concepts of Flink#

Streams: Streams are divided into finite data streams and infinite data streams. An unbounded stream is a never-ending data stream, i.e., an infinite data stream; while a bounded stream is a finite data collection with a defined size, i.e., a finite data stream. The difference is that the data in an infinite data stream continues to increase over time, with ongoing calculations that do not have an end state. In contrast, the data size of a finite data stream is fixed, and calculations will eventually complete and reach an end state.
State: State is the data information during the computation process, playing an important role in fault tolerance recovery and Checkpoints. Stream computing is essentially Incremental Processing, so it requires continuous querying to maintain state. Additionally, to ensure Exactly-once semantics, data must be able to be written into the state; persistent storage can guarantee Exactly-once in the event of failures or crashes in the entire distributed system, which is another value of state.
Time: Divided into Event time, Ingestion time, and Processing time. Flink's infinite data stream is a continuous process, and time is an important basis for judging whether business status is lagging and whether data processing is timely.
API: APIs are typically divided into three layers, from top to bottom: SQL / Table API, DataStream API, and ProcessFunction. The expressive power and business abstraction capability of APIs are very strong, but the closer to the SQL layer, the expressive power gradually weakens while the abstraction capability strengthens. Conversely, the ProcessFunction layer API has very strong expressive power, allowing for various flexible and convenient operations, but the abstraction capability is relatively smaller.

Flink Window#

Count Window driven by event count
Session Window driven by session intervals
Time Window driven by time