Tudor Lapusan's Blog

Post info:

Perfect fit : Apache Spark, Zeppelin and Docker

The goal of this article is to show how easy you can start working with Apache Spark using Apache Zeppelin and Docker.

I played for the first time with docker when Cloudera announced the new quickstart option for trying Apache Hadoop with Cloudera. It was a really nice experience and I was surprised by docker characteristics.

For me the most powerful characteristic was the ability to share containers between users using for exemple Docker Hub.

Here is the official definition of containers “Docker containers wrap up a piece of software in a complete filesystem that contains everything it needs to run: code, runtime, system tools, system libraries – anything you can install on a server. This guarantees that it will always run the same, regardless of the environment it is running in.“.

In other words, it means that I can create a “Spark Intro” container using my laptop (Mac OS), with all the necessary dependencies, like Spark framework, Java library, Python, Spark code examples and all the others dependencies. When all it’s done, I can publish the container on Docker Hub and anybody from the world has the possibility to download the container and try my Spark examples without the need to install Spark and any other dependencies. Even more, you don’t need to have Mac OS , you can install the container on Windows or Linux.

In general, environment setup is hard, difficult and error-prone. Docker help us to skip this step and focus on what we need to solve and achieve. Docker website contains a good and easy tutorial about how to install it.

In the video bellow I will show how to download and run a container which contains all the dependencies to write Spark examples.

I hope you enjoyed watching the video and that you are one step closer to begin learning Spark :)