This article is the first post in a two-part series about data serialization with Avro in HDFS with a focus on benefits of having associated a schema to your data, Avro intro and its main characteristics.
The second article will be focus on data serialization using Apache Avro with practical examples in Java and MapReduce.
HDFS is a very flexible distributed storage system, which let’s you store any kind of data in it. If you store your data in it’s raw format, that means your projects (ex. MapReduce, Spark jobs) would need to understand individually the structure of the data.
Schema describe the structure of data. Having associated a schema to your data eases the process of understanding it and helps you when data structure is changing(ex. add/remove an information).
Next I’m gonna give you some use-cases related to Big Data when it’s important to have associated a schema to your data.
In this case when we have a graph of job executions, the communication between jobs is done through data. For example, if file1 has associated a schema, job1 and job2 would use it to understand the file1 structure. If the schema is gonna change in the future, just a little effort is needed to adapt job1 and job2 to support both the new structure and old structure. Another advantage is that if you have jobs using the old schema, they can read from both new and old file structure.
If the file1 wouldn’t have associated a schema, then both job1 and job2 would have to parse the structure and a lot of work would be needed to support future changes in the file structure.
MapReduce intermediary data
Another use-case is for MapReduce intermediary data. We all know that a MapReduce job contains two main phases, Map and Reduce. Map is the phase which read the input data and prepare the input for the reduce phase. As you can see, the comunication between map and reduce is done also through data. Having associated a schema to this intermediary data give us all the benefits mentioned in the above use-case.
Hadoop Ecosystem has evolved a lot. Many frameworks appeared for different use-cases of data analyses. Having associated a schema to data, means that all these frameworks could use it to understand the data structure instead of implementing independent parsers. Also the benefits mentioned for the first use-case apply here also.
One framework which associates a schema to the data is Apache Avro.
Apache Avro is a serialization framework optimized for Hadoop data storage and data processing which was created with the idea of improving MapReduce serialization framework, Writables. Writables types(ex. IntWritable, Text, etc) are efficient in size and speed for data serialization, but are hard to extend and impossible to be used between different systems written in different languages than Java.
Main characteristics for a serialization framework
Size – many times the network bandwidth is a bottleneck into a distributed system. To have the ability to compact your data its crucial in Big Data projects.
Speed – by using a serialization framework, you add the serialization and deserialization steps into your project executions. You want to add as little overhead as possible with these additional steps.
Extensible – in many cases the structure of your data is going to change during project lifetime, for example you need to add/remove a field. It’s very useful if the serialization framework has the ability to support old and new structures of your data.
Interoperability– for some systems, it’s mandatory to communicate between clients and servers which are written in different programming languages.
Other Avro characteristics
Rich data structures – using Avro we can model nested structure.
Splittable – beeing splittable means that the file from HDFS can be grouped into multiple blocks, a very important aspect for data balancing and processing performance.
Compressible – applying a compressible algorithms helps us to reduce the space available in HDFS. Reading a file in compresible format is even more efficient than reading the same file in its plain format.
Fault-tolerant – Avro treats the best in case you have data corruption, compared with other serialization frameworks, like Protocol Buffer, Thrift or SequenceFile. In case you encounter a corruption in the avro file, the read will continue at the next sync point, so failure affects only a portion of the file.
Avro supported languages
C, C++, C#, Java, PHP, Python, Ruby
In a non-trivial big data architecture, we can end up with projects written in different languages. If we refer again to the first use-case, jobs chain, the job2 could be written in Java, job4 written in C++ and job3 in Python. Having an Avro schema associated with file3, both job3 (Python) and job4(C++) could read the file3 generated by job2(Java).
Avro supported types
Work in progress
Apache Avro is an easy framework to use and provide a lot of features, but still there is more work to be done to make it a perfect serialization framework, like :
- Complex class hierarchies
- More data types (ex. Date)
The fun will begin in the next article, where you will see how an Avro schema look, how to serialize data with Avro and what involve future changes in the file structure. All of these presented with practical examples in Java and MapReduce jobs. I’m still working for posting this article 🙂