Showing posts with label avro data serialization. Show all posts

This post is in continuation with my earlier posts on Apache Avro - Introduction and Apache Avro - Generating classes from Schema.

In this post, we will discuss about reading (deserialization) and writing(serialization) of Avro generated classes.

"Apache Avro™ is a data serialization system." We use DatumReader<T> and DatumWriter<T> for de-serialization and serialization of data, respectively.

Apache Avro formats

Apache Avro supports two formats, JSON and Binary.

Let's move to an example using JSON format.

Employee employee = Employee.newBuilder().setFirstName("Gaurav").setLastName("Mazra").setSex(SEX.MALE).build();

DatumWriter<Employee> employeeWriter = new SpecificDatumWriter<>(Employee.class);
byte[] data;
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
  Encoder jsonEncoder = EncoderFactory.get().jsonEncoder(Employee.getClassSchema(), baos);
  employeeWriter.write(employee, jsonEncoder);
  jsonEncoder.flush();
  data = baos.toByteArray();
}
  
// serialized data
System.out.println(new String(data));
  
DatumReader<Employee> employeeReader = new SpecificDatumReader<>(Employee.class);
Decoder decoder = DecoderFactory.get().jsonDecoder(Employee.getClassSchema(), new String(data));
employee = employeeReader.read(null, decoder);
//data after deserialization
System.out.println(employee);

Explanation on the way :)

Line 1: We create an object of class Employee (AVRO generated)

Line 3: We create an object of SpecificDatumWriter<T> which implements DatumWriter<T> Also, there exists other implementation of DatumWriter viz. GenericDatumWriter and ReflectDatumWriter.

Line 6: We create JsonEncoder by passing Schema and OutputStream where we want the serialized data and In our case, it is in-memory ByteArrayOutputStream.

Line 7: We call #write method on DatumWriter with Object and Encoder.

Line 8: We flushed the JsonEncoder. Internally, it flushes the OutputStream passed to JsonEncoder.

Line 15: We created object of SpecificDatumReader<T> which implements DatumReader<T>. Also, there exists other implementation of DatumReader viz. GenericDatumReader and ReflectDatumReader.

Line 16: We create JsonDecoder passing Schema and input String which will be deserialized.

Let's move to serialization and de-serialization example with Binary format.

Employee employee = Employee.newBuilder().setFirstName("Gaurav").setLastName("Mazra").setSex(SEX.MALE).build();

DatumWriter<Employee> employeeWriter = new SpecificDatumWriter<>(Employee.class);
byte[] data;
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
  Encoder binaryEncoder = EncoderFactory.get().binaryEncoder(baos, null);
  employeeWriter.write(employee, binaryEncoder);
  binaryEncoder.flush();
  data = baos.toByteArray();
}
  
// serialized data
System.out.println(data);
  
DatumReader<Employee> employeeReader = new SpecificDatumReader<>(Employee.class);
Decoder binaryDecoder = DecoderFactory.get().binaryDecoder(data, null);
employee = employeeReader.read(null, decoder);
//data after deserialization
System.out.println(employee);

All the example is same except Line 6 and Line 16 where we are creating an object of BinaryEncoder and BinaryDecoder.

This is how to we can serialize and deserialize data with Apache Avro. I hope you found this article informative and useful. You can find the full example on github.

In this post, we will discuss following items

  • What is Apache Avro?
  • What is Avro schema and how to define it?
  • Serialization in Apache Avro.

What is Apache Avro?

"Apache Avro is data serialization library" That's it, huh. This is what you will see when you open their official page.Apache Avro is:

  • Schema based data serialization library.
  • RPC framework (support).
  • Rich data structures (Primary includes null, string, number, boolean and Complex includes Record, Array, Map etc.).
  • A compact, fast and binary data format.

What is Avro schema and how to define it?

Apache Avro serialization concept is based on Schema. When you write data, schema is written along with it. When you read data, schema will always be present. The schema along with data makes it fully self describing.

Schema is representation of AVRO datum(Record). It is of two types: Primitive and Complex.

Primitive types

These are the basic type supported by Avro. It includes null, int, long, bytes, string, float and double. One quick example:

{"type": "string"}

Complex types

Apache Avro support six complex types i.e. record, enum, array, map, fixed and union.

RECORD

Record uses the name type 'record' and has following attributes.

  • name: a JSON string, providing the name of the record (required).
  • namespace: A JSON string that qualifies the name.
  • doc: A JSON string representing the documentation for the record.
  • aliases: A JSON array, providing alternate name for the record
  • fields: A JSON array, listing fields (required). It has own attributes.
    • name: A JSON string, providing the name of the field (required).
    • type: A JSON object, defining a schema or record definition (required).
    • doc: A JSON string, providing documentation for the field.
    • default: A default value for the field if the instance lack recognition of the field value.
{
  "type": "record",
  "name": "Node",
  "aliases": ["SinglyLinkedNodes"],
  "fields" : [
    {"name": "value", "type": "string"},
    {"name": "next", "type": ["null", "Node"]}
  ]
}
ENUM

Enum uses the type "enum" and support attributes i.e. name, namespace, aliases, doc and symbols (A JSON array).

{ 
  "type": "enum",
  "name": "Move",
  "symbols" : ["LEFT", "RIGHT", "UP", "DOWN"]
}
ARRAYS

Array uses the type "array" and support single attribute item.

{"type": "array", "items": "string"}
MAPS

Map uses the type "map" and support one attribute values. Its key by default are of type string.

{"type": "map", "values": "long"}
UNIONS

Unions are represented by JSON array as ["null", "string"] which means the value type could be null or string.

FIXED

Fixed uses type "fixed" and support two attributes i.e. name and size.

{"type": "fixed", "size": 16, "name": "md5"}

Serialization in Apache Avro

Apache Avro data is always serialized with its schema. It supports two types of encoding i.e. Binary and JSON . You can read more on serialization on their official specification and/ or can see the example usage here.