In the last year, Apache Spark received a lot of attention in big data and data science fields and more and more jobs started to appear. So, if you are reading this page you are on a good way to start working on some cool and challenging projects.
In this post I’m gonna show you how easy can be to start working with Apache Spark. All you need is a PC, IDE (ex. Eclipse, Idea) and basic Java knowledge.
This post is not intended to give you very rich explanations of Spark main functionalities. If you want to learn more about them, I recommend the references from Where to learn more about Spark sub-chapter.
The content of this post is about :
– Apache Spark history
– Where to learn more about Spark
– Spark architecture
– Easy ways to run Spark
– Supported languages
– Spark operations : transformations and actions.
– Prepare you Spark local environment
Apache Spark history
Apache Spark was initially started in 2009 at UC Berkeley AMPLab by two romanian guys, Matei Zaharia and Ionel Stoica and in 2013, the project was donated to Apache Software Foundation under the licence Apache 2.0.
Spark goals were to provide an easier, more friendly API and a better memory management than MapReduce, so developers could concentrate on the logical operations of the computation rather than the details of how it is executed behind.
Where to learn more about Spark
There are a lot of tutorials on internet from where to learn Spark, including this one.
But if you want to allocate more of your free time to understand better and deeply the Spark functionalities, I would recommend reading “Learning Spark” book.
It is an easy reading book, takes your from zero by describing all the functionalities of Spark and contains a lot of practical examples in Java, Scala and Python programming languages.
Other good materials to learn Spark is its own public documentation. The main advantages is that it contains the latest features of Spark. For example, in the moment of writing this post, the book “Learning Spark” is based on Spark 1.3 and its latest documentation is at Spark 1.6
If you really want to learn Spark, I recommend to start with the book and after that you can read about the latest features from the documentation. I would prefer to start with the book because it gives you the information in the right order and is very well explained.
In my opinion, here is the true value of Spark. Learning Spark core basics, give you the possibility to work with multiple use-cases, like batch processing, machine learning, SQL, streaming processing and graph processing. Even better you can combine them, for example, you can combine batch code with SQL code in the same java class. Just imagine how many lines of java code would takes to complete this SQL statement “select * from user_table where location = “US” order by age desc”.
Hadoop ecosystem is very large and contains different frameworks for particular use-cases. Hive for SQL, MapReduce for batch processing, Giraph for graph processing, Storm for realtime processing, Mahout for machine learning. If you want to learn/work with many of these use-cases, somehow to have each time to learn a new framework. Spark comes with all of these use-cases so the learning period to switch from batch processing(Spark Core) to realtime processing(Spark Streaming) is lower than switching from MapReduce to Storm for example.
Easy ways to run Spark
First thoughts you think of before to write some Spark code is that you need a cluster of machines, some linux skills to create the cluster, so you most probably give up of the idea of learning/writing Spark code.
Bellow I will list some easy ways to run your Spark code.
1. Your preferred IDE (ex. IDEA, Eclipse)
You can write and try your code directly from your IDE, whitout any Spark cluster. The practical examples from this post are based on this way.
2. Standalone deploy mode.
If you need to access the functionalities of a Spark cluster, you can deploy Spark on a single machine. All you need is to download a pre-build version of Spark, unzip it and run the script sbin/start-all.sh. You can read more details here.
3. Amazon EMR
Using EMR service, you can deploy a Spark cluster of many machines using just a web page and your mouse. Please read Amazon docs for a complete tutorial. The only drawback is that it cost you some money
4. Hadoop vendors
The biggest Hadoop vendors are Cloudera and Hortonworks. To start with Hadoop, they offer virtual machines you can download and install on your own PC. In this way you have a single Hadoop cluster which will contains the Spark service.
5. Using Docker
You can follow this video tutorial to see how Docker and Zeppelin can help you to run Spark examples(for Scala and Python)
Apache Spark provide native support for Scala, Java, Python and most recently for R language, so a wide range of programmers can try their forces with Spark.
Spark is written mainly in Scala, running into a JVM container, so it also has good performance for Java. Python is supported and offers good performance through the smart use of Py4J.
RDD is simply an immutable distributed collection of objects.
Based on the above image, we can imagine an RDD as a collection of Character (Java type), partitioned and distributed on many machines. Partitions are the units of distribution for RDD, like data blocks for HDFS.
RDD is the way through which Spark offers a good data management during execution, a feature that MapReduce doesn’t offer.
We can create RDDs in two ways :
1. From an existing collection.
2. From an external data source.
Spark operations : transformations and actions
Transformation enable you to create a new RDD from another RDD.
Bellow you can see the diagram which describe ones of the most common transformations : map and filter.
In the case of map transformation, Spark iterates in a distributed way through each element and increment the value by one, resulting in a new RDD, named MapRDD.
All transformations in Spark are lazy.
Action computes a result from an RDD.
An example of action is count(), which return the number of elements from a RDD.
Take() action returns an array with the first n elements of the RDD and saveAsTextFile() persists the RDD elements on an external data source, like HDFS.
Prepare you Spark local environment
The cool think about Spark is that you can learn/work with it using your own laptop, getting rid of the burden of creating a distributed cluster of servers. In Java case, all you need is a IDE (IDEA, Eclipse) and Spark jar dependencies.
The same is the case with the other big data processing frameworks, like MapReduce, Storm or Flink.
You can get the code presented in my videos and other spark examples from github : https://github.com/tlapusan/SparkIntroWorkshop
I hope that now you have all you need to start learning Spark. If you have any questions, comments, please add them to the comment section bellow