Tudor Lapusan's Blog

Post info:

Spark 2.3.0 implementes vertical display row for Dataframe/Dataset

When you have a large number of columns in your Dataframe/Dataset and you want to display all, the result is not very pretty printed. In this short post, I will use Spark 2.3.0(pyspark) and Apache Zeppelin. As you can see in the above image, it’s kind of hard to see whose column each value belongs. Imagine you have way more columns, it’s even harder to understand the results. Until Spark 2.3.0, the single solution I’m aware of is to select fewer columns

Read the full post

Post info:

My first experience with Kaggle kernels

When I’m playing on Kaggle, usually I choose python and sklearn. The usually default tool to write the code in is jupyter notebook, but now I decided to try for the first time kaggle kernels. It is pretty easy to create a new kernel. All you need to do is to choose a competition, click ‘Kernels’ submenu and then click ‘New Kernel’ As you can see from above picture, I choose the well known Titanic competition. After you click the ‘New

Read the full post

Post info:

HDFS : The Hadoop Distributed Filesystem, part 2

Here we are with the second part of the HDFS article. If you didn’t read the first part, you can find it here. If in the first part of the article, I wrote about the main HDFS concepts, like blocks, datanode, namenode, now I will write about other HDFS characteristics, like file operations, HDFS challenges and its integrations with other BigData frameworks.   HDFS operations HDFS offers a simple API which allows us to handle data inside it. There are

Read the full post

Post info:

HDFS : The Hadoop Distributed Filesystem, part 1

As we all know or heard, the amount of data grows exponentially each year. Nowadays almost each person has a mobile phone which is a data generator, there are a lot of websites on internet which generate a lot of logs with click events, user interactions, etc. and in the last years appeared the Internet of things (IoT) where each device contains sensors which can also can generate massive amount of information. So as you may guess, there is a big

Read the full post

Post info:

Perfect fit : Apache Spark, Zeppelin and Docker

The goal of this article is to show how easy you can start working with Apache Spark using Apache Zeppelin and Docker. I played for the first time with docker when Cloudera announced the new quickstart option for trying Apache Hadoop with Cloudera. It was a really nice experience and I was surprised by docker characteristics. For me the most powerful characteristic was the ability to share containers between users using for exemple Docker Hub. Here is the official definition of containers

Read the full post

Post info:

Start working with Apache Spark

In the last year, Apache Spark received a lot of attention in big data and data science fields and more and more jobs started to appear. So, if you are reading this page you are on a good way to start working on some cool and challenging projects. In this post I’m gonna show you how easy can be to start working with Apache Spark. All you need is a PC, IDE (ex. Eclipse, Idea) and basic Java knowledge. This post

Read the full post

Post info:

My Hadoop story #1

Hi, my name is Tudor Lapusan and I’m passionate about BigData technologies, especially Apache Hadoop. The first time I heard about Hadoop was while I was attending the university. This happened at the “Grid, Cluster and Cloud Computing” course where we studied Apache Hadoop and all the laboratories resumed in the end to deploy Hadoop on a cluster made of two nodes and to write word-count job project. I want to thank Adrian Darabant, the teacher of this course, for introducing

Read the full post

Post info:

Data serialization with Apache Avro, part 1

This article is the first post in a two-part series about data serialization with Avro in HDFS with a focus on benefits of having associated a schema to your data, Avro intro and its main characteristics. The second article will be focus on data serialization using Apache Avro with practical examples in Java and MapReduce. HDFS is a very flexible distributed storage system, which let’s you store any kind of data in it. If you store your data in it’s raw

Read the full post

Post info:

Setting Ganglia filters for Hadoop

Metrics filtering is useful when you don’t need all metrics and don’t want to overload your Ganglia cluster with useless information. First of all I’m working with Apache Hadoop 1.1.2(metrics2 implementation), Ganglia 3.1.7 and I assume you have the knowledge about how to integrate Hadoop with Ganglia. If not you can read this article. Filtering can be made on three levels : source, record and metric. To apply any kind of filtering you should define the way filtering will be made

Read the full post

Post info:

Hello world!

I remember when I created for the first time a “Hello World” program in C++, a “Wordcount” in Hadoop MapReduce and now a “Hello World” in WordPress. I’m curious what will be next ? 😉  

Read the full post