The first time I heard about Hadoop was while I was attending the university. This happened at the “Grid, Cluster and Cloud Computing” course where we studied Apache Hadoop and all the laboratories resumed in the end to deploy Hadoop on a cluster made of two nodes and to write word-count job project. I want to thank Adrian Darabant, the teacher of this course, for introducing me to this cutting-edge technology at Babeș-Bolyai University.
First I tried to install Hadoop on a Ubuntu VM (I had Windows as main OS). I fought with Hadoop deployment at least two weeks in my hostel student room. Maybe it was so difficult because it was my first ‘serious’ interaction with the command line. I encoutered a lot of problems until I ask the help of one of my friends, Brandusan Bogdan, who recommended me a great tutorial.
That tutorial explains very clearly and detailed how to deploy Hadoop on a Ubuntu machine. First step was to uninstall Windows from my laptop and to install Ubuntu 12.04. I have followed all the steps from that tutorial and the deployment was pretty straightforward. In the last minutes I succeeded to finish my tasks for “Grid, Cluster and Cloud Computing” course and that was it with Hadoop for a while.
In that time I was working as a Java EE developer on a project I didn’t consider challenging for me because it was mainly bugfixing :). After a while I started missing Hadoop.
I was looking to change my job and I decided to work for Skobbler as a Java server developer. For the first two months I worked on an internal project. After finishing that project, an idea came into my mind when I observed that one of my colleagues was analyzing wikipedia dumps and the total execution took almost a day work, namely 8 hours. I have to specify that the project was written in Java with no level of parallelism.
The working development speed is damn slow when you need to wait 8 hours to see the final results. I knew a little about Hadoop, Skobbler had the data and in the office we had a lot of PCs. The idea was to analyze the wikipedia dumps using Hadoop. In a few days I made a Hadoop cluster from four PCs and after one week I succeeded to adopt the original project to run on Hadoop. The results was very very good. From 8 hours of total time execution, we succeeded on our little cool cluster to run the project, on the same amount of data, in ~20 minutes.
Just imagine how fast was the testing and development speed for the project from that moment on. Instead of waiting one day to see the results, you could see them after only 20 minutes. In one day you could run the project as many times as in two weeks using old way of execution.
You know that this kind of stuff make a developer happier and much more productive but just think how happy would be the management team :D. Of course, I sent an email to Philipp Kandal(Skobbler’s CTO) about the idea and with the results of it. What was his answer… you will find out in my next series of articles.
TO BE CONTINUE….