Hadoop Series - HDFS

Hadoop Series - HDFS#

1. HDFS Startup Process#

Load file metadata
Load log files
Set checkpoints
Enter safe mode. The purpose is: to check the replication rate of data blocks and whether redundancy meets the requirements.

2. HDFS Operating Mechanism#

User files are split into blocks and stored across multiple DataNode servers, with each file having multiple replicas throughout the cluster, which enhances data security.

3. Basic Architecture#

NameNode: The management node of the entire file system, responsible for recording how files are split into Block data blocks and keeping track of the storage information of these data blocks.
1. Fsimage: The image file where metadata is stored on disk.
2. edits: The log file for system operations.
3. Fstime: The time of the last checkpoint.
4. seen_txid: The number of the last edit.
5. version
Secondary NameNode: An auxiliary background program (not a disaster recovery node for NameNode) that communicates with the NameNode to periodically save snapshots of HDFS metadata.
DataNode: Data nodes responsible for reading and writing HDFS data blocks to the local file system.

HDFS is not suitable for storing small files because each file generates metadata, and as the number of small files increases, the metadata also increases, putting pressure on the NameNode.

Federated HDFS#

Each NameNode maintains a namespace, and the namespaces of different NameNodes are independent of each other. Each DataNode needs to register with each NameNode.

Multiple NameNodes share the storage resources of a single cluster's DataNodes, and each NameNode can provide services independently.
Each NameNode defines a storage pool with a separate ID, and each DataNode provides storage for all storage pools.
DataNodes report block information to their corresponding NameNodes according to the storage pool ID, while also reporting the available local storage resources to all NameNodes.
If convenient access to resources on several NameNodes is needed from the client side, a client mount table can be used to map different directories to different NameNodes, but the corresponding directories must exist on the NameNodes.

4. NameNode Working Mechanism#

NameNode Working Mechanism.png
HDFS read/write -> Rolling log records -> SN asks NN if a checkpoint is needed -> Time is up (60 minutes) or edits data is full, triggering a checkpoint -> SN requests to execute the checkpoint -> NN copies edits file and fsimage file to SN -> SN merges edits log into fsimage -> SN synchronizes the merged fsimage back to NN.

5. DataNode Working Mechanism#

DataNode Working Mechanism.jpeg

DataNode starts and registers with NN, then periodically reports block data information.
NN and DataNode communicate through a heartbeat detection mechanism, with a heartbeat every 3 seconds. If no heartbeat is received for more than 10 minutes, the node is considered unavailable.

6. HDFS Data Read Process#

HDFS Read Process.png

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path file = new Path("demo.txt");
FSDataInputStream inStream = fs.open(file);
String data = inStream.readUTF();
System.out.println(data);
inStream.close();

The client initializes the FileSystem object and calls the open() method to obtain a DistributedFileSystem object.
The DistributedFileSystem requests the first batch of block locations from NN via RPC.
The first two steps generate an FSDataInputStream, which is encapsulated into a DFSInputStream object.
The client calls the read() method, and the DFSInputStream finds the nearest DataNode to the client and connects to start reading the first data block of the file, with data being transmitted from the DataNode to the client.
When the first data block is read completely, the DFSInputStream closes the connection and then connects to the next DataNode for continued data transmission.
If there is a communication exception between the DFSInputStream and the DataNode while reading data, it will attempt to connect to the next DataNode containing the data block, and it will record which DataNode encountered an error, skipping that DataNode for the remaining blocks.
After the client finishes reading data, it calls the close() method to close the connection and write the data to the local file system.

7. HDFS Data Write Process#

HDFS Write Process.png

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path file = new Path("demo.txt");
FSDataOutputStream outStream = fs.create(file);
out.write("Welcome to HDFS Java API !!!".getBytes("UTF-8"));
outStream.close();

The client calls the create method to create a file output stream DFSDataOutputStream object and requests NN to upload the file.
The DistributedFileSystem calls NN via RPC to create a new file entry without associated blocks. Before creation, NN checks if the file exists or if there are permissions to create it. If successful, it writes the operation to the EditLog (WAL, write ahead log) and returns the output stream object; otherwise, it throws an IO exception.
The client first splits the file; for example, if a block is 128M and the file is 300M, it will be split into three blocks: two of 128M and one of 44M, then requests NN for which DataNode servers the blocks should be transferred to.
NN returns the writable DN node information, and the client establishes a pipeline connection with multiple DataNodes assigned by the NameNode, writing data into the output stream object.
Data is written to the DataNodes through the FSDataOutputStream object, and this data is split into small packets, queued in a data queue. Each time the client writes a packet to the first DataNode, this packet is directly passed to the second, third, etc., DataNode in the pipeline, not waiting until a block or an entire file is written before distributing.
Each DataNode responds with an ack confirmation message after writing a block, noting that it does not return a confirmation message after writing each packet.
After the client completes writing data, it calls the close method to close the stream.

Supplement:

After the client performs a write operation, only the completed blocks are visible; blocks currently being written are not visible to the client. The client can ensure that the write operation for the file is complete only by calling the sync method. When the client calls the close method, it will automatically call the sync method. Whether to call it manually depends on the balance between data robustness and throughput based on program needs.

Error Handling During Writing:#

Error in a DN Replica Node During Writing:#

The pipeline will be closed first.
Packets that have been sent to the pipeline but not yet acknowledged will be written back to the data queue to avoid data loss.
The currently functioning DataNode will be assigned a new version number (the latest timestamp version can be obtained using the lease information in the NameNode), so that when the faulty node recovers, it will be deleted due to version mismatch.
The faulty node will be removed, and a normal DataNode will be selected to re-establish the pipeline and start writing data again.
If replicas are insufficient, NN will create a new replica on other DataNode nodes.

Client Crash During Writing:#

When the client exits abnormally during data writing, different replicas of the same block may be in inconsistent states. One replica is chosen as the primary data node to coordinate with other data nodes. NN will find the minimum block length among all DN replica nodes holding this block information through the lease mechanism and restore the block to their minimum length.

For detailed reference: HDFS Recovery Process 1

8. HDFS Replication Mechanism#

First replica: If the upload node is a DN, it uploads to that node; if the upload node is an NN, it randomly selects a DN.
Second replica: Placed on a DN in a different rack.
Third replica: Placed on a different DN in the same rack as the second replica.

9. HDFS Safe Mode#

Safe mode is a working state of HDFS, where it provides a read-only view of files to clients and does not accept modifications to the namespace.

When NN starts, it first loads the fsimage into memory, then executes the operations recorded in the edits log. Once the mapping of file system metadata is successfully established in memory, a new fsimage and an empty edits log file are created, at which point NN operates in safe mode.
During this phase, NN collects information from DN and counts the data blocks for each file. When it confirms that the minimum replication condition is met, meaning a certain proportion of data blocks have reached the minimum number of replicas, it exits safe mode. If not met, it arranges for DNs to replicate the insufficiently replicated data blocks until the minimum number of replicas is reached.
When starting a freshly formatted HDFS, it will not enter safe mode because there are no data blocks.

To exit safe mode: hdfs namenode -safemode leave

10. HA High Availability Mechanism#

Reference: Hadoop NameNode High Availability Implementation Analysis

Basic Architecture Implementation#

HA high availability of HDFS is ensured through Zookeeper, with the basic architecture:

Active NameNode and Standby NameNode: Primary and backup NameNode nodes, only the Active primary NameNode provides services externally.
Shared Storage System: Stores the metadata generated during the operation of NN. The primary and backup NN synchronize metadata through the shared storage system, and the new NN can only provide services externally after confirming that the metadata is fully synchronized during the primary-backup switch.
Primary-Backup Switch Controller ZKFailoverController: ZKFC runs as an independent process, capable of promptly monitoring the health status of NN. When the primary NN fails, it uses the ZK cluster to achieve automatic election switching.
DataNode: DataNodes need to upload block information to both primary and backup NN to ensure synchronization of the mapping relationship between HDFS data blocks and DataNodes.
Zookeeper Cluster: Provides primary-backup election support for the primary-backup switch controller.

Primary-Backup Switching Implementation#

Reference: Hadoop NameNode High Availability Implementation Analysis

The primary-backup switching of NameNode is mainly coordinated by the three components: ZKFailoverController, HealthMonitor, and ActiveStandbyElector:

When the primary-backup switch controller ZKFailoverController starts, it creates the two main internal components, HealthMonitor and ActiveStandbyElector. While creating these components, ZKFailoverController also registers corresponding callback methods with them.
HealthMonitor is mainly responsible for detecting the health status of the NameNode. If it detects a change in the NameNode's status, it will call back the corresponding method of ZKFailoverController for automatic primary-backup election.
ActiveStandbyElector is mainly responsible for completing the automatic primary-backup election, encapsulating the processing logic of Zookeeper internally. Once the Zookeeper primary-backup election is completed, it will call back the corresponding method of ZKFailoverController to switch the primary-backup state of the NameNode.

Process Analysis:

After HealthMonitor initialization is complete, it will start an internal thread to periodically call the HAServiceProtocol RPC interface methods of the corresponding NameNode to check the health status of the NameNode.
When a change in NN status is detected, it calls back the corresponding method of ZKFailoverController for processing.
When ZKFailoverController detects that a primary-backup switch is needed, it uses ActiveStandbyElector to handle it.
ActiveStandbyElector interacts with ZK to complete the automatic election, then calls back the corresponding method of ZKFailoverController to notify the current NN.
ZKFailoverController calls the corresponding HAServiceProtocol RPC interface methods of the NameNode to convert the NameNode to Active or Standby state.