Here we are with the second part of the HDFS article. If you didn’t read the first part, you can find it here.
If in the first part of the article, I wrote about the main HDFS concepts, like blocks, datanode, namenode, now I will write about other HDFS characteristics, like file operations, HDFS challenges and its integrations with other BigData frameworks.
HDFS offers a simple API which allows us to handle data inside it. There are multiple clients implementing this API, like the WebHDFS REST API, but the simplest and more familiar for us is the Command Line Client (CLI), which offers commands very similar with the Linux commands to handle data.
Using the commands the CLI client offers us, we can do a lot of operations on our data from HDFS, from creating, deleting, listing files from directories, copying files from local filesystem to HDFS and vice versa, transferring data between HDFS clusters or changing the replication factor of files. Here is a list with more CLI command for HDFS
Let’s see how looks the command to create a directory in HDFS :
“hdfs dfs -mkdir /path_to_dir/dir_name”
“hdfs dfs” is the command used to access the operations responsible for data manipulation and “mkdir” is the same operation from linux to create a directory.
“hdfs dfs -rm -r /path_to_dir/dir_name”
Command used to delete a directory. If you want to delete a file, you just need to remove the “-r” argument.
“hdfs dfs -ls /path_to_dir/dir_name”
Like the ls command from linux, this lists the content of a directory from HDFS.
“hdfs dfs -copyFromLocal /path_to_local_file/ /path_to_dir/dir_name”
This command is used to copy data from local filesystem to HDFS.
“hdfs dfs -copyToLocal /path_to_dir/dir_name/ /path_to_local_file/”
Used to transfer data from HDFS cluster to local filesystem.
“hdfs dfs -help”
Display all the data manipulation commands which CLI client can support. I used this command when I forget the syntax of a command or want to look for the existence of new commands CLI client can offer.
The major benefits of these commands is that they handle all the complexity behind, offering us a simple and transparent interface to HDFS. They allow us to concentrate on our work, namely to do operations of files and directories, even if the HDFS cluster is made of one servers or thousands of servers.
Even for a traditional filesystem it is a challenge to handle data integrity, efficient IO operations and others. Things become even more complicated when the filesystem is deployed on tens, hundreds or even thousands of servers.
The main challenges for HDFS are listed bellow :
- disk failure
- datanode failure
- namenode failure
All of these can happen and they will happen sooner or later.
They can be caused even because of a disk or server physical problem or because the server is not powered.
Based on a google research, in the first year of disk activity, there is a chance of 1.7% of failure to over 8.6% after three years of disk activity. If you are working on an important big data project, you will work more than three years, so the chance of a disk failure is pretty big
I hope you remember the discussions from the first part of this article, that files are made of blocks and each block is replicated (default by 3) on different serves. Using block replication, HDFS can handle disk and server failure pretty easy.
The file from the image bellow is made of 4 blocks, 1,2,3,4. These blocks are distributed on a HDFS cluster made of 4 datanodes with the replication factor equal to 3.
When a client request to read this file(and all the datanodes are alive), namenode can choose block1 from datanode1, block2 from datanode3, block3 from datanode2 and block4 from datanode4. In case if two of the datanodes are down,datanode2 and datanode 4, at the same time, HDFS clients can read the file because blocks 1,2,3,4 are still available on datanode1 and datanode3.
In the beginning of HDFS, it was a single point of failure. It had only a single namenode and if that namenode was down, you were in big problems because your cluster could not be used by anyone and even worst, it could happen to lose data.
Now HDFS can with the solution of two namenodes, one running as active namenode and the other as standby namenode. In case when the first namenode is down, the standby namenode will take its place. You can read more about how this happen and how those two namenodes are sincronizing their metadata from HDFS official documentation.
Integration with other big data frameworks
As I said in the first part of this article, in the beginning Apache Hadoop contained two main components : HDFS and MapReduce. You may guess that these two components work very well together
MapReduce can work with multiple filesystems (local, S3, etc) but the default one is HDSF.
The main advantage between this good integration is data locality. Data locality refers that both the computation and data will be on the same servers.
Let’s see how this applies for MR and HDFS. Before MR starts the computation (job), it asks the HDFS on what servers are the blocks associated with the job input. After MR receives the response, it tries to send the tasks on the same servers with data. In this way, there is a minimum network data transfer. If it’s not possible to be on the same server, because the server is busy with other tasks, it sends the tasks on the servers from the same rack with data. If it’s not the case, it will send randomly on servers from other racks.
As you can see, MR tries to keep computation and data as close as possible because network data transfer is one of the major bottlenecks all the big data frameworks tries to avoid.
Apache Spark is the new cool big data processing framework which works with HDFS in the same way MR does.
Worth to mention is a little change in data locality, where Spark adds another policy. It first try to send tasks on servers which contain data in memory.
HDFS doesn’t work very well with small files and does not yet support file edits.
Because the blocks are very large,which is very good for scan operations, the random access operations are very, very poor. This is what HBase tries to resolve, random access for big data.
Apache HBase is a distributed and scalable NoSql database. Use HBase when you need random, realtime read/write access to your Big Data. The project’s goal is the hosting of very large tables, billions of rows with X millions of columns, atop clusters of clusters if commodity hardware, like HDFS.
HBase has its own filesystem atop of HDFS. The default block size for HDFS is 128 MB, which makes it very hard to use for random read/write operations. HBase solves this by making it’s blocks 64 KB by default, so the read/write operations are more efficient for granular operations.
Other important features
Bellow I just want to post another HDFS features I think are important and let you to read about them from HDFS official documentation
I hope you enjoyed reading this last article about HDFS. I tried to highlight it’s main functionalities and where it can be used in big data world.
If you want to learn more about it, I recommend to read its documentation and the HDFS chapter from Hadoop : The definitive guide book.