Introduction to Flume and Basic Usage

Introduction to Flume and Basic Usage#

Content reference: Official Documentation

Basic Architecture of Flume#

External data sources send events to Flume in a specific format. When the source receives events, it stores them in one or more channels, which will keep the events until they are consumed by the sink. The main function of the sink is to read events from the channel and store them in an external storage system or forward them to the next source, and upon success, remove the events from the channel.

Basic Concepts#

Event
Source
Channel
Sink
Agent

Event#

An Event is the basic unit of data transmission in Flume NG, similar to messages in JMS and messaging systems. An Event consists of a header and a body: the former is a key/value mapping, and the latter is an arbitrary byte array.

Agent#

An Agent is an independent (JVM) process that contains components such as Source, Channel, Sink, etc.

Source#

The data collection component that collects data from external data sources and stores it in the Channel. It has dozens of built-in types, such as Avro Source, Thrift Source, Kafka Source, JMS Source.

Channel#

The Channel is a pipeline between the source and the sink, used for temporarily storing data. It can be in-memory or a persistent file system:

Memory Channel: Uses memory, with the advantage of speed, but data may be lost (e.g., sudden crashes);
File Channel: Uses a persistent file system, with the advantage of ensuring data is not lost, but is slower.

Built-in Memory Channel, JDBC Channel, Kafka Channel, File Channel, etc.

Sink#

The main function of the Sink is to read Events from the Channel and store them in an external storage system or forward them to the next Source, and upon success, remove the Event from the Channel. It includes HDFS Sink, Hive Sink, HBase Sinks, Avro Sink, etc.

Flume Transactions#

When data is transmitted to the next node (usually in batches), if an exception occurs at the receiving node, such as a network failure, the batch of data will be rolled back, which may lead to data retransmission (it is retransmission, not duplication). Within the same node, if the Source writes data to the Channel and an exception occurs within a batch of data, it will not be written to the Channel, and the already received partial data will be discarded, relying on the previous node to retransmit the data.

source -> channel: put transaction
channel -> sink: take transaction

Steps for put transaction:

doput: First, write the batch data into a temporary buffer called putlist.
docommit: Check if there is space in the channel; if so, pass the data in; if not, dorollback will roll back the data to the putlist.

Steps for take transaction:

dotake: Read the data into a temporary buffer called takelist and send the data to HDFS.
docommit: Determine if the data was sent successfully; if successful, clear the temporary buffer takelist. If not successful (e.g., if the HDFS system server crashes), dorollback will roll back the data to the channel.

Reference link: Flume Transaction Analysis

Reliability of Flume#

When a node fails, logs can be transmitted to other nodes without loss. Flume provides three levels of reliability guarantees, from strong to weak:

end-to-end: When data is received, the agent first writes the event to disk, and after successful transmission, deletes it; if data transmission fails, it can be resent.
Store on failure: This is also the strategy used by Scribe; when the data receiver crashes, the data is written locally and sent after recovery.
Besteffort: After data is sent to the receiver, no confirmation will be made.

Deployment Types of Flume#

Single Flow#

Multi-Agent Flow (Multiple agents connected in sequence)#

Multiple Agents can be connected in sequence, collecting the initial data source and storing it in the final storage system. This is the simplest case; generally, the number of such sequentially connected Agents should be controlled, as the path through which data flows becomes longer. If failover is not considered, a failure will affect the collection service of all Agents in the Flow.

Consolidation (Merging flows, multiple Agents aggregating data to the same Agent)#

This scenario is commonly applied, such as collecting user behavior logs from a website. The website uses a load balancing cluster mode for availability, and each node generates user behavior logs. An Agent can be configured for each node to collect log data separately, and then multiple Agents will aggregate the data to a single storage system, such as HDFS.

Multiplexing the Flow#

Flume supports sending events from one Source to multiple Channels, which means passing events to multiple Sinks. This operation is called Fan Out. By default, Fan Out copies Events to all Channels, meaning all Channels receive the same data. Flume also supports customizing a multiplexing selector on the Source to implement custom routing rules.

For example, when logs from syslog, Java, Nginx, Tomcat, etc., start flowing into an agent, the mixed log stream can be separated within the agent, and a separate transmission channel can be established for each type of log.

Yige