In this article I will discuss about the different components of Hadoop distributed file system or HDFS. Going by the definition, Hadoop Distributed File System or HDFS is a distributed storage space which spans across an array of commodity hardware. This file system is stable enough to handle any kind of fault and has an efficient throughput which the stream data can access in an efficient and reliable manner. The architecture of HDFS is suitable to store large volume of data. HDFS can process data very rapidly. HDFS is a part of Apache Hadoop eco-system.
Apache Hadoop is a software framework provided by the open source community. This is used in applications which require storing and processing of large scale of data-sets on a cluster of commodity hardware. Hadoop is licensed under the Apache License 2.0.
The Apache Hadoop framework is composed of the following modules:
- Hadoop Common – The common module contains libraries and utilities which are required by other modules of Hadoop.
- Hadoop Distributed File System (HDFS) – This is the distributed file-system which stores data on the commodity machines. This is the core of the hadoop framework. This also provides a very high aggregate bandwidth across the cluster.
- Hadoop YARN – This is the resource-management platform which is responsible for managing computer resources over the clusters and using them for scheduling of users' applications.
- Hadoop MapReduce – This is the programming model used for large scale data processing.
All of these modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines or racks of machines) are common and should be capable of automatically handling the software by the framework. Apache Hadoop's MapReduce and HDFS components are originally derived from the Google's MapReduce and Google File System (GFS) respectively.
The Hadoop Distributed File System or the HDFS is a distributed file system that runs on commodity hardware. It is very similar to any existing distributed file system. However, there are significant differences from other distributed file systems. HDFS is designed to be highly fault-tolerant and can be deployed on a low-cost hardware. It also provides high throughput access to application data and is suitable to handle applications that have large data sets.
Goals and Assumptions of HDFS
Hadoop Distributed File System or HDFS is designed and developed based on certain assumptions to achieve its goals. The important ones are listed under -
- Failure of Hardware - Hardware failure has become a regular norm and is no more an exception. A typical HDFS instance consists of hundreds or thousands of server machines. Each of these storing units is part of the file systems. In fact, there exist a huge number of components and each of these components are very susceptible to hardware failure. This means that there are some components in the HDFS cluster which are always non-functional. Hence the fault detection and prompt and may be automatic recovery from there should be the core architectural goal of HDFS.
- Streaming Data Access - Applications which are running on HDFS require accessing their data sets in bulk. These are different from the typical applications which run on general purpose file systems. HDFS is designed to support batch processing in contrast to interactive use by users. The stress is on high throughput of data access rather than having a low latency of data access.
- Larger Data Sets - Applications which are using HDFS have larger data sets. Standard practice is to have a file in HDFS of size ranging from gigabytes to terabytes. Hence it becomes essential to ensure that HDFS is tuned to support large files. It should be able to provide high aggregate data bandwidth and also should be able to scale up to hundreds of nodes on a single cluster. It should be good enough to support tens of millions of files on a single instance.
- Simple Coherency Model - Typical HDFS applications support the theory - write-once-read-many access model for files. A file which is once created, written, and closed should not be changed. This resolves the data coherency issues and enables high throughput data access. A typical MapReduce based application or a web crawler application perfectly fits in this model. As per the Apache notes, there is a plan to support appending-writes to files in the future.
- Moving Computation is Cheaper than Moving Data - From the programming point of view we are aware of the fact that a computation requested by an application performs much better if it is executed near the data it should operate upon. This fact becomes stronger while dealing with large set of data. The fact of the matter is that the lesser the network trip is done by the application, the more throughputs is achieved. It is always better to perform any sort of operation on the data right after it is available rather than traverse the data to a different layer and then perform the operation. HDFS has interfaces which allow the applications to move closer to where the data is located.
- Portability across Heterogeneous Hardware and Software Platforms - HDFS is designed to be portable from one platform to another. This enables widespread adoption of HDFS as a platform of choice while dealing with large set of data.
Hadoop federation is used to scale up the name service horizontally. It uses several namenodes or namespaces which are independent of each other. These independent namenodes are arranged in a separated manner. This means they don’t require any sort of inter coordination. The datanodes here are used as common storage by all the namenodes. Each datanode is registered with all the namenodes in the cluster. These datanodes keep on sending periodic reports to all the name nodes. At the same time they respond to the commands from the name nodes. Normally there is a block pool which is a set of blocks belonging to a single namespace. On a cluster, the datanode stores blocks for all the block pools. Each block pool is managed independently. This helps the name space to generate unique block ids for new blocks without informing other namespaces. If one namenode fails for any unforeseen reason, the datanode keeps on serving using some other namenodes.
One namespace and its corresponding block are collectively called the Namespace Volume. When a namespace or a namenode is deleted, the corresponding block pool and the datanode also gets deleted automatically. In the process of cluster up gradation, each namespace volume is upgraded as a unit.
Figure 1: An HDFS federation architecture
Benefits of HDFS Federation
Hadoop federation comes up with some advantages and benefits. These are listed as under –
- Scalability and Isolation – Since we have multiple namenodes, we can easily horizontally scale up in the file system namespace. This allows us to separate the namespace volumes for users and categories of application thus providing an absolute isolation.
- Generic Storage Service – Since we have a block level pool abstraction which allows the architecture to build new file systems on top of the block storage. We can build new applications very easily on the block storage layer without bothering about the file system interface. We can also have customized categories of block pool which are different from the default block pool.
- Simple Design – Namenodes and namespaces are always independent of each other. There is hardly any scenario which requires changing the existing name nodes. Each name node is built to be robust. Federation is also backward compatible. It can easily integrate with the existing single node deployments which work without any configuration changes.
Salient Features of HDFS
HDFS comes with some salient features. These features are of point of interest for many users. These are listed below –
- Hadoop distributed file system or HDFS, is a perfect match for distributed storage and distributed processing over the commodity hardware. Hadoop is fault tolerant, scalable, and very easy to scale up or down. MapReduce, which is well known for its simplicity and applicability in case of large set of distributed applications, comes as an integral part of Hadoop.
- HDFS is highly configurable. The default configuration setup is good and strong enough to support most of the applications. In general, the default configuration needs to be tuned only for very large clusters.
- Hadoop is written in Java which is supported on almost all major platforms.
- Hadoop supports shell-like commands to interact with HDFS directly.
- The NameNode and Datanodes have their own built in web servers which make it easy to check current status of the cluster.
- New features and updates are frequently implemented in HDFS. The following list is a subset of the useful features available in HDFS:
- File permissions and authentication.
- Rack awareness: this helps to take a node's physical location into account while scheduling tasks and allocating storage.
- Safemode: this is the administrative mainly used mode for maintenance purpose.
- fsck: this is a utility used to diagnose health of the file system, and to find missing files or blocks.
- fetchdt: this is a utility used to fetch DelegationToken and store it in a file on the local system.
- Rebalancer: this is tool used to balance the cluster when the data is unevenly distributed among DataNodes.
- Upgrade and rollback: once the software is upgraded, it is possible to roll back to the HDFS’ state before the upgrade in case of any unexpected problems.
- Secondary NameNode: this node performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.
- Checkpoint node: this node performs periodic checkpoints of the namespace and helps minimize the size of the log stored at the NameNode containing changes to the HDFS. Replaces the role previously filled by the Secondary NameNode, though is not yet battle hardened. The NameNode allows multiple Checkpoint nodes simultaneously, as long as there are no Backup nodes registered with the system.
- Backup node: this node is an extension to the Checkpoint node. In addition to check-pointing, it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.
The HDFS architecture consists of namenodes and datanodes. The picture shown above describes the HDFS architecture, which explains the basic interactions between the NameNode, the DataNodes, and the clients. The call is initiated in the client component, which calls the NameNode for file metadata or file modifications. Once the name node responses, the client then takes up the task of performing the actual file I/O operation straight away with the DataNodes. Let us talk about the architecture in detail:
The HDFS namespace consists of files and directories. These files and directories are represented by inodes on the NameNode. These Inodes have the task to keep a track of attributes e.g. permissions, modification and access times, the allotted quota for namespace and disk space. Content of the file is broken into large blocks usually a size of 128 megabytes, but user can also set the block size depending upon the situation. And each block of the file is independently replicated at multiple DataNodes. Normally the data is replicated on three datanode instances but user can set this count as per need.
Image and Journal
The inodes and the list of blocks which are used to define the metadata of the name system is called the image. The NameNode stores the whole of the namespace image in RAM. The persistent record of the image, which is stored in the NameNode's local file system, is called the checkpoint. The NameNode record changes to HDFS are written in a log which is called the journal.
Each and every transaction which is initiated by the client is logged in the journal. The journal file is flushed and synced every time before sending the acknowledgment to the client. The checkpoint is a file which is never changed by the NameNode. A new file is written whenever a checkpoint is created.
For better durability, redundant copies of the checkpoint and the journal are maintained on multiple independent local volumes and at remote NFS servers. While writing the journal into one of the storage directories, if the NameNode encounters an error it excludes that directory from the list of storage directories. The NameNode then automatically goes down when there is no storage directory available for that node.
The NameNode is designed to be a multithreaded system. It can process requests simultaneously from multiple clients. Saving a transaction into the disk often becomes a bottleneck because of the fact that other threads need to wait till the synchronous flush-and-sync procedure, which is initiated by one of these threads is complete. In order to optimize this process, the NameNode handles multiple transactions in one batch. When one of the NameNode's threads initiates a flush-and-sync operation, all the transactions which are batched at that point of time are committed in one go. In that case, the remaining threads are only required to check that their transactions have been saved or not.
The DataNode replica block consists of two files on the local filesystem. The first file is for the data while the second file is for recording the block's metadata. The metadata here includes the checksums for the data and the generation stamp. The data file size should be the same of the actual length of the block. It and does not require any extra space to round it up to the nominal block size as in the traditional file systems. Hence if any of the blocks is half full it requires only half of the space of the full block on the local drive.
During the startup each DataNode connects to its corresponding NameNode and does the handshaking. This handshaking verifies the namespace ID and the software version of the DataNode. If there is any mismatch found, the DataNode goes down automatically.
The namespace ID is assigned to the file system instance as soon as it is formatted. This namespace ID is stored on all nodes of the cluster. The nodes which have a different namespace ID will not be allowed to join the cluster. This protects the integrity of the file system. A DataNode which is newly initialized and does not have any namespace ID is allowed to join the cluster and get the cluster's namespace ID.
Once the handshaking is done, the DataNode gets registered with the NameNode. The DataNodes store their unique storage IDs. These storage IDs are internal identifiers of the DataNodes. This makes it uniquely identifiable even if it is restarted on a different IP address or port. The storage ID gets assigned to the DataNode when it is registered with the NameNode for the first time and it never changes after that.
A DataNode identifies the block replicas under its possession to the NameNode by sending a block report. A block report is a combination of the block ID, the generation stamp and the length for each block replica the server hosts. Report from the first block is sent immediately after the DataNode registration. The subsequent block reports are then sent every hour and provide the NameNode with an up-to-date view of where block replicas are located on the cluster.
While doing the usual operation, the DataNodes sends signals to the corresponding NameNode in order to confirm that the DataNode is operating and the block replicas which it had hosted, are live. On default, these signal heartbeat interval is three seconds. If the NameNode does not receive any signal from a DataNode for ten minutes, the NameNode considers that the DataNode is out of service and the block replicas which are hosted by that DataNode becomes unavailable. The NameNode then schedules the formation of new replicas of those blocks on other DataNodes.
Signals from the DataNode also carry the information about the total storage capacity, fraction of the storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode's block allocation and load balancing decisions.
The client applications access the file system via the HDFS client. HDFS client is a library which exports the HDFS file system interface. Similar to the most conventional file systems, HDFS supports the basic operations e.g. read, write and delete files along with and operations to create and delete directories. The clients reference these files and directories by their paths in the namespace. The client application is not need to know about the location and position of the file system metadata and storage. These can reside on different servers, or the blocks might have multiple replicas.
When a client application reads a file, the HDFS client first checks the NameNode for the list of DataNodes which host the replicas of the blocks of the file. This list is sorted by the network topology distance from the client location. The client then contacts the DataNode directly and requests to transfer the desired block. When a client writes, it first seeks the DataNode from the NameNode. The client then organizes a pipeline from node-to-node and starts sending the data. Once the initial block is filled, client requests for new DataNodes. These DataNodes are to be chosen to host replicas of the next block. A fresh pipeline is then organized, and the client sends further bytes of the file. Choice of DataNodes for every single block is different. The interactions among the client, the NameNode and the DataNodes is shown in the picture above
In contrast to the conventional file systems, HDFS provides an API which exposes the locations of the file blocks. This allows applications like MapReduce framework to schedule a task which can define the location where the data are located. This improves the read performance. This also allows the application to set the replication factor of a file. By default the replication factor is three. For critical files or files which are being accessed very often, it advised to have a higher replication factor which further improves the fault tolerance and also increases the read bandwidth.
In addition to its primary role of serving the client requests, the NameNode in HDFS, is capable of executing either of two roles - a CheckpointNode or a BackupNode. These roles are specified at the node startup.
The CheckpointNode is a node which periodically combines the existing checkpoint and the journal to create a new checkpoint and an empty journal. Normally the CheckpointNode runs on a host which is different from the NameNode, because of the fact that the memory requirements for both of these are same. The Checkpoint Node downloads the current checkpoint and the journal files from the NameNode, merges these two locally and finally returns the new checkpoint back to the NameNode.
By creating periodic checkpoints we can easily protect the file system metadata. The system can start from the most recent checkpoint if all the other persistent copies of the namespace image or journal become unavailable. Creating a checkpoint also allows the NameNode to truncate the journal when the new checkpoint is uploaded to the NameNode. HDFS clusters run for prolonged amount of time without being restarted. The journal keeps on constantly growing during this phase. If the journal grows up to a very large size, the probability increases of loss or corruption of the journal file. Also, a very large number of journals requires higher amount of time to restart the NameNode. For a large size cluster, it takes more than an hour to process a week-long journal. The best practice is to create a daily checkpoint.
Backup Node is introduced very recently as a feature of HDFS. Similar to the CheckpointNode, the BackupNode is capable of creating periodic checkpoints. In addition to this, it is capable to maintain an in-memory, up-to-date image of the file system namespace which is always synchronized with the state of the NameNode.
The BackupNode is always ready to accept the journal stream of the namespace transactions from the active NameNode. It then saves them in the journal on its own storage directories, and then applies these transactions on its own namespace image in the memory. The NameNode treats the BackupNode as journal storage, in the same way as it treats the journal files in its storage directories. If the NameNode fails for any reason, the BackupNode's image in the memory and the checkpoint on disk is a record of the latest namespace state.
The BackupNode is also capable of creating the checkpoint without even downloading the checkpoint and journal files from the active NameNode because of the fact that it already contains an up-to-date namespace image in its memory. This enables the checkpoint start processing on the BackupNode in a more efficient manner as it only needs to save the namespace on its local storage directories.
The BackupNode is viewed as a read-only NameNode. It contains all file systemmetadata information except the block locations. It can perform all operations of the regular NameNode which do not involve any modification of the namespace or knowledge of block locations. Using a BackupNode provides the option of running the NameNode without having a proper persistent storage, thus delegating the responsibility of storing the namespace state to the BackupNode.
Upgrades and Filesystem Snapshots
While upgrading the software, it is quite possible that some data may get corrupt. That is the reason that we create snapshots in HDFS in order to minimize the potential damage to the data which is stored in the system during the upgrades.
The snapshot mechanism enables the administrators to persistently save the current state of the file system. Hence if the upgrade leads to a data loss or corruption it is possible to rollback the upgrade and return the HDFS to the namespace and storage state to the state they were while taking the snapshot. Only one snapshot can exist at a given point of time.
The snapshot is created at the cluster administrator's choice whenever the system is started. If a snapshot is requested, the NameNode first reads the checkpoint and journal files and merges them in the local memory. It then creates the new checkpoint and a blank journal to a new location, thus ensuring that the old checkpoint and journal remains unchanged.
During handshaking NameNode instructs the DataNodes whether to create a local snapshot or not. This local snapshot on the DataNode cannot be created by just replicating the directories containing the data files since the replication would require doubling the storage capacity of every DataNode on the cluster. Instead of that each DataNode makes a copy of the storage directory and creates hard links of the existing block files into it. When the DataNode removes a block, only the hard link gets deleted. The block modification during these appends use the copy-on-write technique. Thus old block replicas remains untouched in their old directories.
Let us conclude our discussion in the form of following bullets -
- HDFS is the distributed file-system which stores data on the commodity machines. This is the core of the hadoop framework. This also provides a very high aggregate bandwidth across the cluster.
- HDFS comes with an array of features which are well accepted in the industry. These are explained in detail above.
- The HDFS architecture is a robust architecture which is capable to handle large datasets.