Hadoop Distributed File System (HDFS)
In this blog we will learn Apache Hadoop Distributed File System and its components.
Apache Hadoop framework provides two major things
Distributed filesystem (HDFS)
Distributed Processing – A framework to process large datasets parallelly using MapReduce.
HDFS is one of the building blocks of Hadoop ecosystem. HDFS is designed to store very large data sets reliably while running on commodity hardware. It is fault-tolerant and can handle large amount of data.
There are two main components of HDFS NameNode and DataNode.
HDFS follows a master-slave architecture in which NameNode is node which acts as the master node. One HDFS cluster consists of only one NameNode. The main functionality of NameNode is to manage the ﬁle system namespace and control the client authentication to the ﬁles stored in the HDFS cluster. It also handles the mapping of the data stored in different DataNodes.
DataNode are the nodes which as the name indicates stores the actual data in the cluster. There are multiple DataNodes in the cluster, usually the number of DataNodes is same as the node of hardware nodes in the cluster. DataNode serve the read and write requests from the clients and also handles operation related to blocks of data like creation of blocks, deletion and replication of blocks.
Starting the HDFS
To start HDFS, use the following command to run the start-dfs.sh ﬁle:
The command will start the NameNode and DataNode in Hadoop cluster.