As we all know or heard, the amount of data grows exponentially each year.
Nowadays almost each person has a mobile phone which is a data generator, there are a lot of websites on internet which generate a lot of logs with click events, user interactions, etc. and in the last years appeared the Internet of things (IoT) where each device contains sensors which can also can generate massive amount of information.
So as you may guess, there is a big challenge only to store this data and even harder to process it. To gain some meaningful insights and improve your company services is other discussion 🙂
In the past, when the amount of data was reasonable, you could store it into a relational database or even on a single server. But now it’s almost impossible to do it in the same way simply because your data can grow beyond the capacity of a single server.
This article is the first post in a two-part series about HDFS. The focus of this first article is to just introduce you with HDFS main concepts, like Namenode, Datanode and blocks.
The second article will be focus on how to use HDSF operations to handle data files, HDFS challenges and its integration with other big data frameworks, like Spark, MapReduce and HBase.
I hope you at least heard about Apache Hadoop project. It was the first open source technology which allowed companies of any size to deploy an infrastructure ready to store and analyse data with minimum of resources and efforts.
In the beginning, Apache Hadoop had two main components : MapReduce and HDFS.
MapReduce is the component responsible for data computation and HDFS with storing it. In this article, we are gonna discuss more about HDFS, describing its main features, architecture details and its integration with other big data frameworks. So let’s start…
The HDFS abbreviation came from Hadoop Distribution File System. It is distributed because it more servers to store data and give us a simple interface through which we can manage data operations without worrying that inside the filesystem are tens, hundreds or even thousands of servers.
The usual hard disks, those we are using in our computers, are build on the concept of disk blocks, which are the minimum amount of information a hard disk can read or write. Filesystems built for traditional single hard disk are dealing with data in blocks, which are larger than a disk blocks. Filesystem blocks are typically few kilobytes in size, while disk blocks are normally 512 bytes.
Let’s see how files are represented into a traditional filesystem, for example on our laptop.
From the above image you can see that a file is represented as many small file blocks which are spread on disk and together contains all the information from the file.
A file in HDFS looks the same, but with few remarks:
– blocks are larger (128 MB by default).
– blocks are distributed on hard disks from many servers.
– blocks are replicated
– if a file is smaller than HDFS block size, it doesn’t occupy the full block size.
HDFS is build on the master-slave architecture. So, it contains a master which coordinate all the operations and many slaves which store the data.
Namenode plays the role of master in HDFS. Every time you want to do some operations with data from HDFS, for example delete a file, add a new file, you first interact with Namenode.
Let’s say we want to add a new file in HDFS, using command “hdfs dfs -copyFromLocal /path/to/local/file /path/to/hdfs/file“. This command first ask NN if the file already exists, if not NN will decide on which slaves to save the file.
All the information about files we have in HDFS is name metadata. Metadata knows where on the servers our files are distributed, what permissions and owners they have.
picture with namenode, secondary namenode.
Datanodes play the role of slaves. Their main responsibilities are to store data.
As I said above, a file is stored as more large blocks in HDFS and each block is replicated, default by 3 replicas.
Let’s say we have a file made up of four blocks, as the image bellow show.
The next image shows you how the file is saved into HDFS. You can see that each blocks is replicated by tree, on different datanodes from the cluster.
HDFS guarantees data availability even if some of the datanodes are not responding or are down. This is possible simple because the blocks are replicated on multiple datanodes. In our use-case bellow, even if two datanodes are down, the cluster still has access to the four blocks, 1,2,3,4.
It is a pretty simple concept which works very well, with the cost of multiplying your volume of data.
I hope you have a high level knowledge about HDFS main concepts. If you want to read more, I strongly recommend the HDFS chapter from Hadoop, The definitive guide.
If you have any comments, please don’t hesitate to send them to me 🙂