Hadoop Distributed File System (HDFS)

In this blog we will learn Apache Hadoop Distributed File System and its components.

Apache Hadoop framework provides two major things

  • Distributed filesystem (HDFS)

  • Distributed Processing – A framework to process large datasets parallelly using MapReduce.

HDFS

HDFS is one of the building blocks of Hadoop ecosystem. HDFS is designed to store very large data sets reliably while running on commodity hardware. It is fault-tolerant and can handle large amount of data.

HDFS Nodes

There are two main components of HDFS NameNode and DataNode.

NameNode

HDFS follows a master-slave architecture in which NameNode is node which acts as the master node. One HDFS cluster consists of only one NameNode. The main functionality of NameNode is to manage the file system namespace and control the client authentication to the files stored in the HDFS cluster. It also handles the mapping of the data stored in different DataNodes.

DataNode

DataNode are the nodes which as the name indicates stores the actual data in the cluster. There are multiple DataNodes in the cluster, usually the number of DataNodes is same as the node of hardware nodes in the cluster. DataNode serve the read and write requests from the clients and also handles operation related to blocks of data like creation of blocks, deletion and replication of blocks.

Starting the HDFS

To start HDFS, use the following command to run the start-dfs.sh file:

/usr/home/anvinfosystem/hadoop 2.3/sbin/start-dfs.sh

 

The command will start the NameNode and DataNode in Hadoop cluster.

 

Leave a Reply

Your email address will not be published. Required fields are marked *